巴西专利BR112019014651A2 methods for sequencing nucleic acid molecules and for preparing sequencing adapters, a computer prog

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The described modalities refer to methods, apparatus, systems and computer program products to determine sequences of interest using sequences of unique molecular indexes that are uniquely associative with fragments of individual polynucleotides, including sequences with low frequencies of alleles and length of long string. In some implementations, sequences of singular molecular indices include non-random sequences of varying length. In some implementations, sequences of unique molecular indices are associated with fragments of individual polynucleotides based on alignment scores that indicate similarity between sequences of unique molecular indices and the subsequences of sequence readings obtained from the fragments of individual polynucleotides. System, apparatus and computer program products are also provided to determine a sequence of interest that implements the described methods.
公开号:BR112019014651A2
申请号:R112019014651-2
申请日:2018-01-05
公开日:2020-07-21
发明作者:Kevin Wu；Chen Zhao；Han-Yu Chuang；Alex So；Stephen Tanner；Stephen M. Gross
申请人:Illumina, Inc.；
IPC主号:

专利说明:

[001] [001] This application claims benefits under article 35 U.S.C. $ 119 (e) to United States Provisional Patent Application No. 62 / 447,851, entitled: METHODS AND SYSTEMS FOR GENERATION AND ERROR-
[002] [002] Next generation sequencing technology is providing ever higher sequencing speed, allowing for greater sequencing depth. However, because sequencing accuracy and sensitivity are affected by errors and noise from various sources, for example, sample defects, PCR during library preparation, enrichment, grouping, and sequencing, increasing the sequencing depth alone cannot guarantee the detection of very low allele frequency sequences, such as in fetal cell-free DNA (cf (DNA) in maternal plasma, circulating tumor DNA (ctDNA), and subclonal mutations in pathogens. Therefore, it is desirable to develop methods for determining sequences of DNA molecules in small quantities and / or low frequency of alleles while suppressing inaccuracy in sequencing due to various sources of errors. SUMMARY
[003] [003] The implementations described refer to methods, apparatus, systems and computer program products for determining nucleic acid fragment sequences using unique molecular indices (UMIs). In some implementations, UMIs include non-random UMIs (NRUMIs) or unique non-random molecular indices of varying length (v «NRUMIs).
[004] [004] One aspect of the description provides methods for sequencing nucleic acid molecules from a sample. The method includes: (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter includes a unique non-random molecular index, and where unique non-random molecular indexes of the adapters are at least two lengths different molecules and form a set of unique non-random molecular indices of varying length (v «NRUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vVNRUMIs; (d) identify, among the plurality of readings, readings associated with the same single non-random molecular index of variable length («NRUMI); and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same vNRUMI.
[005] [005] In some implementations, identifying the readings associated with the same vNRUMI includes obtaining, for each reading of the plurality of readings, alignment scores with respect to the set of vNRUMIs, each alignment score indicating similarity between a subsequence of a reading and a vNRUMI , where the subsequence is in a reading region in which YNRUMI-derived nucleotides are likely to be located.
[006] [006] In some implementations, alignment scores are based on nucleotide pairings and nucleotide edits between the subsequence of reading and VY'NRUMI. In some implementations, nucleotide edits include nucleotide substitutions, additions, and deletions. In some implementations, each alignment score penalizes mismatches at the beginning of a sequence, but does not penalize mismatches at the end of the sequence.
[007] [007] In some implementations, obtaining an alignment score between a reading and a vNRUMI includes: (a) calculating an alignment score between VY'NRUMI and each of all possible prefix sequences for the reading subsequence; (b) calculate an alignment score between the reading sequence and each of all possible vNRUMI prefix sequences; and (c) obtain a higher alignment score among the alignment scores calculated in (a) and (b) as the alignment score between reading and VuNRUMI.
[008] [008] In some implementations, the subsequence has a length that is equivalent to a length of the longest YNRUMI in the set of vNRUMIs. In some implementations, identifying the readings associated with the same vYNRUMI in (d) additionally includes: selecting, for each reading of the plurality of readings, at least one vNRUMI from the set of VvWRUMIs based on the alignment scores; and associate each reading of the plurality of readings with at least one vNRUMI selected for the reading.
[009] [009] In some implementations, selecting at least one VNRUMI from the set of vNRUMIs includes selecting a vWNRUMI having a higher alignment score among the set of vNRUMIs. In some implementations, at least one vNRUMI includes two or more VNRUMIs.
[0010] [0010] In some implementations, the method additionally includes selecting one of two or more vWWRUMI as the same vNRUMI of (d) and (e).
[0011] [0011] In some implementations, the adapters applied in (a) are obtained through: (1) the provision of a set of oligonucleotide sequences having at least two different molecular lengths; (1) from the selection of a subset of oligonucleotide sequences from the oligonucleotide sequence set, all editing distances between oligonucleotide sequences from the oligonucleotide sequence subset meeting a threshold value, the subset of oligonucleotide sequences forming the set VNRUMISs; and (ii) the synthesis of the adapters, each comprising a double strand hybridized region, a single 5 'strand arm, a single 3' strand arm, and at least one vNRUMI of the vNRUMI array. In some implementations, the threshold value is 3. In some implementations, the set of VvWNRUMIs includes 6 nucleotide vNRUMIs and 7 nucleotide YNRUMIs.
[0012] [0012] In some implementations, the determination of (e) includes collapsing the readings associated with the same vNRUMI into a group to obtain a consensus nucleotide sequence for the DNA fragment sequence in the sample. In some implementations, the consensus nucleotide sequence is obtained partially based on the quality scores of the readings.
[0013] [0013] In some implementations, the determination of (e) includes: identifying, among the readings associated with the same vVNRUMI, readings having the same reading position or similar reading positions in a reference sequence, and determining the sequence of the fragment of DNA using readings that (1) are associated with the same vNRUMI and (ii) have the same reading position or similar reading positions in the reference sequence.
[0014] [0014] In some implementations, the set of vWNRUMIs includes no more than about 10,000 different vNRUMIs. In some implementations, the set of vNRUMIs includes no more than about
[0015] [0015] In some implementations, applying adapters to the DNA fragments in the sample includes applying adapters to both ends of the DNA fragments in the sample.
[0016] [0016] Another aspect of the description relates to methods for preparing sequencing adapters, the methods including: (a) providing a set of oligonucleotide sequences having at least two different molecular lengths; (b) selecting a subset of oligonucleotide sequences from the oligonucleotide sequence set, all editing distances between oligonucleotide sequences from the oligonucleotide sequence subset meeting a threshold value, the subset of oligonucleotide sequences forming a set of indices unique non-random molecules of variable length (vNRUMIs); and (c) synthesizing a plurality of sequencing adapters, each sequencing adapter including a hybridized double-stranded region, a single 5 'ribbon arm, a single 3' ribbon arm, and at least one vNRUMI from the set of VvNRUMIs.
[0017] [0017] In some implementations, (b) includes: (1) selecting an oligonucleotide sequence from the set of oligonucleotide sequences; (ii) adding the selected oligonucleotide to an oligonucleotide sequence expansion set and removing the selected oligonucleotide from the oligonucleotide sequence set to obtain a reduced set of oligonucleotide sequences; (ii) selecting a present oligonucleotide sequence from the reduced set that maximizes a distance function, where the distance function is a minimum editing distance between the present oligonucleotide sequence and any oligonucleotide sequences in the expansion set, and in which the distance function meets the threshold value; (iv) adding the present oligonucleotide to the expansion set and removing the present oligonucleotide from the reduced set; (v) repeat (111) and (iv) one or more times; and (vi) providing the expansion set as the subset of oligonucleotide sequences forming the set of VvNRUMIs.
[0018] [0018] In some implementations, (v) includes repeating (ill) and (Iv) until the distance function no longer meets the threshold value.
[0019] [0019] In some implementations, (v) includes repeating (iii) and (1v) until the expansion set reaches a defined size.
[0020] [0020] In some implementations, the present oligonucleotide sequence or an oligonucleotide sequence in the expansion set is shorter than a longer oligonucleotide sequence in the set of oligonucleotides, the method additionally includes, before (iii), (1) attach a thymine base or thymine base plus any four base to the present oligonucleotide sequence or the oligonucleotide sequence in the expansion set, thereby generating a filled sequence having the same length as the longest oligonucleotide sequence in the set of oligonucleotide sequences, and (2) use the completed sequence to calculate the minimum editing distance. In some implementations, editing distances are Levenshtein distances. In some implementations, the threshold value is 3.
[0021] [0021] In some implementations, the method additionally includes, before (b), removing certain oligonucleotide sequences from the oligonucleotide sequence set to obtain a filtered set of oligonucleotide sequences; and providing the filtered set of oligonucleotide sequences as the set of oligonucleotide sequences from which the subset is selected.
[0022] [0022] In some implementations, certain oligonucleotide sequences include oligonucleotide sequences having three or more consecutive identical bases. In some implementations, certain oligonucleotide sequences include oligonucleotide sequences having a combined number of guanine and cytosine bases less than 2 and oligonucleotide sequences having a combined number of guanine and cytosine bases greater than 4.
[0023] [0023] In some implementations, certain oligonucleotide sequences include sequences of oligonucleotides having the same base in the last two positions. In some implementations, certain oligonucleotide sequences include sequences of oligonucleotides having a subsequence that matches the 3rd end of one or more sequencing primers.
[0024] [0024] In some implementations, certain oligonucleotide sequences include oligonucleotide sequences having a thymine base in the last position of the oligonucleotide sequences.
[0025] [0025] In some implementations, the set of VWNRUMIs includes 6 nucleotide VNRUMIs and 7 nucleotide YNRUMIs.
[0026] [0026] An additional aspect of the description relates to a method for sequencing nucleic acid molecules in a sample, including (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter includes an index single non-random molecular, and in which single non-random molecular indices of the adapters have at least two different molecular lengths and form a set of unique non-random molecular indices of varying length (v «NRUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vNRUMIs; and (d) to identify, among the plurality of readings, readings associated with the same single non-random molecular index of variable length (v «NRUMLI).
[0027] [0027] In some implementations, the method additionally includes obtaining a count of the readings associated with the same vYNRUMI.
[0028] [0028] Another aspect of the description relates to a method for sequencing nucleic acid molecules in a sample, including (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter includes a molecular index unique (UML), and where the adapter's unique molecular indexes (UMIs) have at least two different molecular lengths and form a set of unique variable length molecular indexes (vUMIs); (b) amplifying the DNA adapter products to obtain a plurality of polynucleotides - amplified; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vUMIs; and (d) to identify, among the plurality of readings, readings associated with the same single molecular index of variable length (vUMD).
[0029] [0029] In some implementations, the method additionally includes determining a sequence of a DNA fragment in the sample using the readings associated with the same vUMI.
[0030] [0030] In some implementations, the method additionally includes obtaining a count of the readings associated with the same vUMIs.
[0031] [0031] Yet another aspect of the description relates to a method for sequencing nucleic acid molecules in a sample, including (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter includes a molecular index unique (UMI) in a set of unique molecular indices (UMIs); (b) amplifying the DNA adapter products to obtain a plurality of polynucleotides - amplified; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of UMIs; (d) obtain, for each reading of the plurality of readings, alignment scores in relation to the set of UMIs, each alignment score indicating similarity between a subsequence of a reading and a UMI; (e) identify, among the plurality of readings, readings associated with the same UMI using the alignment scores; and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same UMI.
[0032] [0032] In some implementations, alignment scores are based on nucleotide pairings and nucleotide edits between the subsequence of reading and the UMI. In some implementations, each alignment score penalizes mismatches at the beginning of a sequence, but does not penalize mismatches at the end of the sequence. In some implementations, the set of UMIs includes UMIs of at least two different molecular lengths.
[0033] [0033] System, apparatus, and computer program products are also provided to determine DNA fragment sequences that implement the described methods.
[0034] [0034] One aspect of the description provides a computer program product including a non-transitory, machine-readable medium that stores program code that, when run by one or more processors in a computer system, causes the computer system to implement a method for determining sequence information for a sequence of interest in a sample using unique molecular indices (UMIs). The program code includes instructions for performing the previous methods.
[0035] [0035] Although the examples here refer to humans and the language is mainly focused on human issues, the concepts described here are applicable to nucleic acids of any virus, plant, animal, or other organism, and to populations of the same (metagenomas, viral populations, etc.) These and other features of the present description will become more clearly apparent from the following description, with reference to the figures,
[0036] [0036] All patents, patent applications, and other publications, including all sequences described within those references, referred to herein are expressly incorporated herein by reference, to the same extent as if each individual publication, patent or patent application were specifically and individually indicated as incorporated by reference. All documents cited are, in part relevant, incorporated herein by reference in their entirety for the purposes indicated by the context of your citation here. However, the citation of any document should not be interpreted as an admission that is prior art in relation to this description. BRIEF DESCRIPTION OF PROJECTS
[0037] [0037] Figure 1A is a flow chart illustrating an exemplary workflow using UMIs in sequence nucleic acid fragments.
[0038] [0038] Figure IB shows a DNA fragment / molecule and the adapters used in the initial stages of the workflow shown in Figure 1A.
[0039] [0039] Figure 1C is a block diagram showing a process for sequencing DNA fragments using vYNRUMIs to suppress errors.
[0040] [0040] Figure 1D illustrates a process 140 for making sequencing adapters having vNRUMIs.
[0041] [0041] Figure 1E shows examples of how a substring of a reading or a query sequence (Q) can be compared to two reference sequences (S1 and S2) in the vNRUMI set.
[0042] [0042] Figure 1F illustrates examples of how glocal alignment scores can provide better suppression of error than global alignment scores.
[0043] [0043] Figure 2A schematically illustrates five different adapter designs that can be adopted in the various implementations.
[0044] [0044] Figure 2B illustrates a hypothetical process in which UMI jump occurs in a PCR reaction that involves adapters having two physical UMISs in two arms.
[0045] [0045] Figure 2C shows data that contrast the reading quality scores of sequence readings using NRUMI versus a control condition.
[0046] [0046] Figures 3A and 3B are diagrams showing the materials and reaction products of binder adapters to fragments of double strips according to some methods described here.
[0047] [0047] Figures 4A-4E illustrate how methods as described here can suppress different sources of error in determining the sequence of a double-stranded DNA fragment.
[0048] [0048] Figure 5 schematically illustrates the application of physical UMIs and virtual UMIs to effectively obtain long pair end readings.
[0049] [0049] Figure 6 is a block diagram of a dispersed system for processing a test sample.
[0050] [0050] Figure 7 illustrates a computer system that can serve as a computational device according to certain modalities. DETAILED DESCRIPTION
[0051] [0051] The description refers to methods, apparatus, systems and computer program products for sequencing nucleic acids, especially nucleic acids with limited quantity or low concentration, such as fetal cfDNA in maternal plasma or circulating tumor DNA (ctDNA) in the blood of a cancer patient.
[0052] [0052] Numeric ranges are inclusive of the numbers that define the range. It is intended that any maximum numerical limitation given throughout this specification includes any lower numerical limitation, as if such lower numerical limitations were expressly written here. Any minimum numerical limitation given throughout this specification will include any upper numerical limitation, as if such upper numerical limitations were expressly written here. Any numerical range given throughout this specification will include any narrower numerical range that falls within such a wider numerical range, as if such narrower numerical ranges were all expressly written here.
[0053] [0053] The titles provided here are not intended to limit the description.
[0054] [0054] Unless otherwise specified here, all technical and scientific terms used here have the same meaning as that normally understood by someone of ordinary skill in the art. Several scientific dictionaries that include the terms included here are well known and accessible to those inserted in the art. Although any methods and materials similar or equivalent to those described here will find use in the practice or testing of the modalities described here, some methods and materials are described.
[0055] [0055] The terms defined immediately below are more fully described by reference to the Descriptive Report as a whole. It should be understood that this description is not limited to the particular methodology, protocols, and reagents described, as they may vary, depending on the context that they are used by those skilled in the art. Definitions
[0056] [0056] As used here, the terms in the singular "one", "one", "o" and "a" include the plural reference unless the context clearly indicates otherwise.
[0057] [0057] Unless otherwise stated, nucleic acids are written from left to right in 5º to 3º orientation and amino acid sequences are written from left to right in amino to carboxy orientation, respectively.
[0058] [0058] Unique molecular indices (UMIs) are sequences of nucleotides applied to or identified in DNA molecules that can be used to distinguish individual DNA molecules from each other. Since UMIs are used to identify DNA molecules, they are also referred to as unique molecular identifiers. See, for example, Kivioja, Nature Methods 9, 72-74 (2012). UMIs can be sequenced together with the DNA molecules with which they are associated to determine whether the reading sequences are those of a source DNA molecule or another. The term "UMI" is used here to refer to both the sequence information of a polynucleotide and the physical polynucleotide in S ;.
[0059] [0059] Normally, multiple instances of a single source molecule are sequenced. In the case of sequencing by synthesis using Ilumina sequencing technology, the source molecule can be amplified by PCR prior to release to a flow cell. Whether amplified by PCR or not, the individual DNA molecules applied to the flow cell are bridged or amplified in ExAmp to produce a cluster. Each molecule in a cluster is derived from the same source DNA molecule, but is sequenced separately. For error correction and other purposes, it may be important to determine that all readings in a single cluster are identified as derived from the same source molecule. UMIs allow this grouping. The DNA molecule that is copied by amplification or otherwise to produce multiple examples of the DNA molecule is referred to as a source DNA molecule.
[0060] [0060] In addition to errors associated with the source DNA molecules, errors can also occur in a region associated with UMIs. In some implementations, the latter type of error can be corrected by mapping a reading sequence to a most likely UMI within a pool of UMIs.
[0061] [0061] UMIs are similar to barcodes, which are normally used to distinguish readings from one sample from readings from other samples, but UMIs are instead used to distinguish one source DNA molecule from another when many DNA molecules are sequenced together. Since there can be many more DNA molecules in a sample than there are samples in a sequencing run, there are typically many more distinct UMIs than distinct barcodes in a sequencing run.
[0062] [0062] As mentioned, UMIs can be applied to or identified in individual DNA molecules. In some implementations, UMIs can be applied to DNA molecules by methods that physically bind or attach UMIs to DNA molecules, for example, by binding or transposition through polymerase, endonuclease, transposases, etc. These "applied" UMIs are therefore also referred to as physical UMIs. In some contexts, they can also be referred to as exogenous UMIs. The UMIs identified among source DNA molecules are referred to as virtual UMIs. In some context, virtual UMIs can also be referred to as endogenous UMI.
[0063] [0063] Physical UMIs can be defined in many ways. For example, they can be sequences of random, pseudo-random or partially random or non-random nucleotides that are inserted into adapters or otherwise incorporated into source DNA molecules to be sequenced. In some implementations, physical UMIs can be so unique that each is expected to uniquely identify any given source DNA molecule present in a sample. The collection of adapters is generated, each having a physical UMI, and those adapters are affixed to fragments or other source DNA molecules to be sequenced, and the individual sequenced molecules each have a UMI that helps to distinguish it from all other fragments. In such implementations, a very large number of different physical UMIs (for example, many thousands to millions) can be used to uniquely identify fragments of DNA in a sample.
[0064] [0064] Naturally, the physical UMI must be of sufficient length to ensure this uniqueness for each and every source DNA molecule. In some implementations, a less unique molecular identifier can be used in conjunction with other identification techniques to ensure that each source DNA molecule is uniquely identified during the sequencing process. In such implementations, multiple fragments or adapters can have the same physical UMI. Other information such as alignment location or virtual UMIs can be combined with the physical UMI to uniquely identify readings as being derived from a single source DNA fragment / molecule. In some implementations, adapters include physical UMIs limited to a relatively small number of non-random strings, for example, 120 non-random strings. Such physical UMIs are also referred to as non-random UMIs. In some implementations, non-random UMIs can be combined with sequence position information, sequence position, and / or virtual UMIs to identify readings attributable to the same source DNA molecule. The readings identified can be combined to obtain a consensus sequence that reflects the sequence of the source DNA molecule as described here. Using physical UMIs, virtual UMIs, and / or alignment locations, one can identify readings having the same UMIs or locations or related UMIs or locations, identified readings which can then be combined to obtain one or more consensus strings. The process for combining readings to obtain a consensus sequence is also referred to as “collapse” of readings, which is further described hereinafter.
[0065] [0065] A "virtual single molecular index" or "virtual UMI" is a unique subsequence in a source DNA molecule. In some implementations, virtual UMIs are located at or near the ends of the source DNA molecule. One or more such single-ended positions can alone or in conjunction with other information uniquely identify the source DNA molecule. Depending on the number of distinct source DNA molecules and the number of nucleotides in the virtual UMI, one or more virtual UMIs can uniquely identify source DNA molecules in a sample. In some cases, a combination of two unique virtual molecular identifiers is required to identify the source DNA molecule. Such combinations can be extremely rare, possibly found only once in a sample. In some cases, one or more virtual UMIs in combination with one or more physical UMIs can uniquely identify a source DNA molecule.
[0066] [0066] A "random UMI" can be considered a physical UMI selected as a random sample, with or without substitution, from a set of UMIs consisting of all possible different oligonucleotide sequences given one or more sequence lengths. For example, if each UMI in the set of UMIs has n nucleotides, then the set includes 4'n UMIs having sequences that are different from each other. A random sample selected from 4ºn UMIs constitutes a random UMI.
[0067] [0067] Conversely, a "non-random UMI" (NRUMI) as used here refers to a physical UMI that is not a random UMI. In some embodiments, non-random UMIs are predefined for a particular experiment or application. In certain modalities, rules are used to generate sequences for a set or to select a sample from the set to obtain a non-random UMI. For example, the sequences in a set can be generated in such a way that the sequences have a particular pattern or patterns. In some implementations, each sequence differs from any other sequence in the set by a particular number of (for example, 2, 3, or 4) nucleotides. That is, no non-random UMI sequence can be converted to any other available non-random UMI sequence by substituting less than the particular number of nucleotides. In some implementations, a set of NRUMIs used in a sequencing process includes less than all possible UMIs given a particular sequence length. For example, a set of NRUMIs having 6 nucleotides can include a total of 96 different sequences, instead of a total of 4º6 = 4096 different possible sequences.
[0068] [0068] In some implementations where non-random UMIs are selected from a set of less than all possible different sequences, the number of non-random UMIs is less, sometimes significantly, than the number of source DNA molecules. In such implementations, non-random UMI information can be combined with other information, such as virtual UMIs, reading locations in a reference sequence, and / or reading sequence information, to identify readings of sequences derived from the same molecule. Source DNA.
[0069] [0069] The term “non-random molecular index of variable length” (v «NRUML) refers to a UMI in a set of vYNRUMIs selected from a pool of UMIs of varying molecular lengths (or heterogeneous length) using a process of non-random selection. The term vYNRUMI is used to refer to both the UMI molecule and the UMI sequence. In some implementations, certain UMIs can be removed from the pool of UMIs to provide a filtered pool of UMIs, which pool is then used to generate the set of VNRUMIs.
[0070] [0070] In some implementations, each vNRUMI differs from any other YNRUMI in the set used in a process in at least a defined editing distance. In some implementations, a set of VNRUMIs used in a sequencing process includes less than all possible UMIs given the relevant molecular lengths. For example, a set of VvWNRUMIs having 6 and 7 nucleotides can include a total of 120 different sequences (instead of a total of 4 + 47 = 20480 possible different sequences). In other implementations, strings are not selected randomly from a set. Instead, some strings are selected more likely than other strings.
[0071] [0071] The term "molecular length" is also referred to as sequence length, and can be measured in nucleotides. The term molecular length is also used interchangeably with the terms molecular size, DNA size and sequence length.
[0072] [0072] Editing distance is a metric quantification of how dissimilar two strings (for example, words) are from each other by counting the minimum number of operations required to transform one chain into the other. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be seen as chains of the letters A, C, Ge T.
[0073] [0073] Different forms of editing distance use different sets of chain operations. Levenshtein distance is a common type of editing distance. Levenshtein distance chain operations consider numbers of deletions, insertions, and character substitutions in the chain. In some implementations, other variants of editing distances can be used. For example, other variants of editing distance can be obtained by restricting the set of operations. Longest common subsequence distance (LCS) is editing distance with insertion and deletion as the only two editing operations, both at unit cost. Similarly, only by allowing substitutions, Hamming distance is obtained, which is restricted to chains of equal length. Distance from Jaro — Winkler can be obtained from an editing distance where only transpositions are allowed.
[0074] [0074] In some implementations, different chain operations can be weighed differently for an editing distance. For example, a substitution operation can be weighed at a value of 3, while an indel can be weighed at a value of 2. In some implementations, pairings of different types may be weighed differently. For example, an A-A match may be twice the weight of the G-G match.
[0075] [0075] An alignment score is a score that indicates a similarity of two sequences determined using an alignment method. In some implementations, an alignment score considers the number of edits (for example, deletions, insertions, and character substitutions in the string). In some implementations, an alignment score considers a number of matches. In some implementations, an alignment score considers both the number of pairings and a number of editions. In some implementations, the number of pairings and editions is equally heavy for the alignment score. For example, an alignment score can be calculated as: number of matches - number of insertions - number of eliminations - number of substitutions. In other implementations, the numbers of pairings and editions can be weighed differently. For example, an alignment score can be calculated as: number of matches x 5 - number of insertions x 4 - number of eliminations x 4 - number of substitutions x 6.
[0076] [0076] The term “paired end readings” refers to readings
[0077] [0077] As used here, the terms "alignment" and "align" refer to the process of comparing a reading to a reference sequence and thus determining whether the reference sequence contains the reading sequence. An alignment process, as used here, attempts to determine whether a reading can be mapped to a reference sequence, but does not always result in a reading aligned to the reference sequence. If the reference sequence contains the reading, the reading can be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. In some cases, alignment simply says whether a reading is a member of a particular reference sequence or not (that is, whether the reading is present or absent in the reference sequence). For example, aligning a reading to the reference sequence for human chromosome 13 will tell whether the reading is present in the reference sequence for chromosome 13.
[0078] [0078] Naturally, alignment tools have many additional aspects and many other applications in bioinformatics that are not described in this application. For example, alignments can also be used to determine how similar two DNA sequences from two different species are, thus providing a measure of how closely related they are in an evolutionary tree.
[0079] [0079] In some implementations here, alignment is performed between a substring of a reading and a vNRUMI as a reference sequence to determine an alignment score as further described hereinafter. Alignment scores between a reading and multiple vWNRUMIs can then be used to determine which of the VNRUMIs the reading should be associated with or mapped to.
[0080] [0080] In some cases, an alignment additionally indicates a location in the reference sequence where the reading maps. For example, if the reference sequence is the entire human genome sequence, an alignment may indicate that a reading is present on chromosome 13, and may additionally indicate that the reading is from a particular tape and / or chromosome site 13. In some scenarios, alignment tools are imperfect, in which a) not all valid alignments are found, and b) some alignments obtained are invalid. This is due to several reasons, for example, readings may contain errors, and sequenced readings may differ from the reference genome due to differences in haplotypes. In some applications, alignment tools include built-in incompatibility tolerance, which tolerates certain degrees of base pair mismatch and even allows alignment of readings for a reference sequence. This can help to identify valid alignment of readings that would otherwise be missed.
[0081] [0081] Aligned readings are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known reference sequence such as a reference genome. An aligned reading and its location determined in the reference sequence constitutes a sequence marker. Alignment can be done manually, although it is typically implemented by a computer algorithm, as it would be impossible to align readings within a reasonable period of time to implement the methods described here. An example of a sequence alignment algorithm is the global-local (glocal) hybrid alignment method for comparing a reading prefix sequence to a vWNRUMI as further described hereinafter. Another example of an alignment method is the computer program Efficient Local Alignment of Nucleotide Data (ELAND) distributed as part of the Ilumina Genomics Analysis pipeline system. Alternatively, a Bloom filter or similar set association tester can be employed to align readings with reference genomes. See US patent application No. 14 / 354,528, filed April 25, 2014, which is incorporated herein by reference in its entirety. The pairing of a sequence reading in the alignment can be a sequence with 100% pairing or less than 100% (i.e., a non-perfect pairing). Additional alignment methods are described in U.S. Patent Application No. 15 / 130,668 (Attorney Reference ILMNPO008) filed on April 15, 2016, which is incorporated by reference in its entirety.
[0082] [0082] The term “mapping” used here refers to the assignment of a reading sequence to a larger sequence, for example, a reference genome, by alignment.
[0083] [0083] The terms "polynucleotide", "nucleic acid" and "nucleic acid molecules" are used interchangeably and refer to a covalently linked nucleotide sequence (i.e., RNA ribonucleotides and DNA deoxyribonucleotides) in which the position 3 of the pentose of a nucleotide is joined by a phosphodiester group to position 5 of the pentose of the next. Nucleotides include sequences of any form of nucleic acid, including, but not limited to, RNA and DNA molecules such as cell-free DNA (cfDNA) molecules. The term "polynucleotide" includes, without limitation, single and double stranded polynucleotides.
[0084] [0084] The term "test sample" refers here to a sample, typically derived from a biological fluid, cell, tissue, organ, or organism, which includes a nucleic acid or mixture of nucleic acids having at least one sequence of nucleic acid that must be screened for copy number variation and other genetic changes, such as, but not limited to, single nucleotide polymorphism, insertions, deletions, and structural variations. In certain embodiments, the sample has at least one nucleic acid sequence whose copy number is suspected to have varied. Such samples include, but are not limited to, oral sputum / fluid, amniotic fluid, blood, a blood fraction, or fine needle biopsy samples, urine, peritoneal fluid, pleural fluid, and the like. Although the sample is often taken from a human subject (for example, a patient), the assays can be used for samples from any mammal, including, but not limited to, dogs, cats, horses, goats, sheep, cattle, pigs, etc. ., as well as mixed populations, such as microbial populations in the wild, or viral patient populations. The sample can be used directly as obtained from the biological source or following pre-treatment to modify the character of the sample. For example, such a pretreatment may include preparing blood plasma, diluting viscous fluids, and so on. Pre-treatment methods can also involve, but not
[0085] [0085] The term “next generation sequencing (NGS)” here refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and single nucleic acid molecules. Non-limiting examples of NGS include sequencing by synthesis using coatable dye terminators, and sequencing by ligation.
[0086] [0086] The term "reading" refers to a sequence reading of a portion of a nucleic acid sample. Typically, although not necessarily, a reading represents a short sequence of contiguous base pairs in the sample. The reading can be represented symbolically by the base pair sequence in A, T, C, and G of the sample portion, along with a probabilistic estimate of the base correction (quality score). It can be stored on a memory device and processed as appropriate to determine whether it matches a reference string or meets other criteria. A reading can be obtained directly from a sequencing device or indirectly from stored sequence information for the sample. In some cases, a reading is a DNA sequence of sufficient length (for example, at least
[0087] [0087] The terms "site" and "alignment site" are used interchangeably to refer to a unique position (ie chromosome ID, chromosome position and orientation) in a reference genome. In some embodiments, a site can be a residue, sequence marker, or segment position in a reference sequence.
[0088] [0088] As used herein, the term "reference genome" or "reference sequence" refers to any particular known genetic sequence, whether partial or complete, of any organism or virus that can be used for identified reference sequences from a subject. For example, a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nim.nih.gov. A "genome" refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences. However, it is understood that “complete” is a relative concept, since even the gold standard reference genome is expected to include gaps and errors.
[0089] [0089] In some implementations, a vWNRUMI string can be used as a reference string to which a prefix string for a reading is aligned. Alignment provides an alignment score between the reading prefix sequence and VvNRUMI, which can be used to determine whether the reading and YvNRUMI should be associated in a process to collapse the readings associated with the same vYNRUMI.
[0090] [0090] In several modalities, the reference sequence is significantly longer than the readings that are aligned to it. For example, it can be at least about 100 times larger, or at least
[0091] [0091] In one example, the reference sequence is that of a full-length human genome. Such sequences can be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome such as chromosome 13. In some embodiments, a reference Y chromosome is the Y chromosome sequence of the human genome version hg19. Such sequences can be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, subchromosomal regions (such as strips), etc., of any species.
[0092] [0092] In some embodiments, a reference sequence for alignment may have a sequence length of about | at about 100 times the length of a reading. In such embodiments, alignment and sequencing are considered targeted alignment or sequencing, rather than alignment or sequencing of the entire genome. In these embodiments, the reference sequence typically includes a genetic sequence and / or another restricted sequence of interest. In this sense, aligning a subsequence of a reading to a vNRUMI is a form of targeted alignment.
[0093] [0093] In several modalities, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence can be taken from a particular individual.
[0094] [0094] The term "derivative" when used in the context of a nucleic acid or a mixture of nucleic acids, here refers to the medium by which the nucleic acid (s) is / are obtained from the source which it (s) originates. For example, in one embodiment, a mixture of nucleic acids that is derived from two different genomes means that nucleic acids, for example, cfDNA, have been naturally released by cells through naturally occurring processes such as necrosis or apoptosis. In another embodiment, a mixture of nucleic acids that is derived from two different genomes means that the nucleic acids were extracted from two different types of cells in a subject.
[0095] [0095] The term "biological fluid" here refers to a liquid taken from a biological source and includes, for example, blood, serum, plasma, sputum, fluid wash, cerebrospinal fluid, urine, semen, sweat, tears, saliva and similar. As used herein, the terms "blood", "plasma" and "serum" expressly cover fractions or processed portions thereof. Similarly, where a sample is taken from a biopsy, harvest, examination, etc., the “sample” expressly includes a processed fraction or portion derived from the biopsy, harvest, examination, etc.
[0096] [0096] As used here, the term "chromosome" refers to the genetic carrier with the inheritance of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized human genome chromosome numbering system is employed here.
[0097] [0097] The term "initiator", as used here, refers to an isolated oligonucleotide that is capable of acting as a point of initiation of synthesis when placed under conditions inductive to the synthesis of an extension product (for example, conditions include nucleotides, an inducing agent such as DNA polymerase, fons and necessary molecules, and an appropriate temperature and pH). The primer may preferably be single-stranded for maximum amplification efficiency, but alternatively it may be double-stranded. If double-stranded, the initiator is first treated for
[0098] [0098] The next generation sequencing technology (NGS) has developed rapidly, providing new tools for advancement in science and research, as well as in health and services that depend on genetic information and related biological information. NGS methods are carried out in a massively parallel manner, providing an increasingly high speed for determining biomolecule sequence information. However, many of the NGS methods and associated sample manipulation techniques introduce errors such that the resulting strings have a relatively high error rate, ranging from an error in a few hundred base pairs to an error in a few thousand pairs of data. base. Such error rates are sometimes acceptable for determining heritable genetic information such as germline mutations since such information is consistent across most somatic cells, which provide many copies of the same genome in a test sample. An error originating from reading a copy of a sequence has a minimal or removable impact when many copies of the same sequence are read without error. For example, if an erroneous reading of a copy of a sequence cannot be properly aligned with a reference sequence, it can simply be discarded for analysis. Error-free readings of other copies of the same sequence may still provide sufficient information for valid analysis. Alternatively, instead of discarding the reading having a different base pair than other readings in the same sequence, you can discard the pair of
[0099] [0099] However, such error correction approaches do not work well to detect low frequency allele sequences, such as subclonal somatic mutations found in tumor tissue nucleic acids, circulating tumor DNA, low-concentration fetal cfDNA in maternal plasma, drug-resistant mutations of pathogens, etc. In these examples, a DNA fragment can harbor a somatic mutation of interest at a sequence site, while many other fragments at the same sequence site do not have the mutation of interest. In such a scenario, the readings of sequences or base pairs of the mutated DNA fragments may be unused or misinterpreted in conventional sequencing, thus losing information to detect the mutation of interest.
[00100] [00100] Due to these various sources of errors, only increasing the depth of sequencing cannot guarantee the detection of somatic variations with very low allele frequency (for example, <1%). Some implementations described here provide duplex sequencing methods that effectively suppress errors in situations when signals of valid sequences of interest are low, such as samples with low allele frequencies.
[00101] [00101] Unique molecular indices (UMIs) make it possible to use information from multiple readings to suppress sequencing noise. UMIs, along with contextual information such as alignment positions, allow us to trace the origin of each reading to a specific original DNA molecule. Given multiple readings that were produced by the same DNA molecule, computational approaches can be used to separate current variants (that is, variants biologically present in the original DNA molecules) from variants artificially introduced through sequencing error. Variants may include, but are not limited to, insertions, deletions, multiple nucleotide variants, single nucleotide variants, and structural variants. Using this information, one can infer the true sequence of the DNA molecules. This computational methodology is referred to as reading breakdown. This error reduction technology has several important applications. In the context of cell-free DNA analysis, important variants often occur at extremely low frequencies (ie, <1%); so your signal can be muffled by sequencing errors. Noise reduction based on UMI allows us to call these low frequency variants much more accurately. UMIs and reading collapse can also help to identify duplicates of PCR in high coverage data, enabling more accurate variant frequency measurements.
[00102] [00102] In some implementations, random UMIs are used, in which a random sequence has been affixed to DNA molecules, and those random sequences have been used as UMI barcodes. However, the use of a set of intentionally designed non-random UMIs has allowed for simpler fabrication in some implementations. Since this approach is non-random, UMIs are referred to as non-random UMIs (NRUMIs) In some implementations, a set of NRUMIs consists of sequences of uniform length (for example, n = 6 long nucleotides). Due to the A-tailing process by which these NRUMI molecules are linked to DNA molecules, the 7th (n + 1) reading is invariably a thymine (T). This can uniformly cause a degradation in the quality of the reading that spreads through the reading cycles downstream of this base. This effect is illustrated in figure 2C.
[00103] [00103] Although this issue may be less prominent in non-standard flow cells sequenced using 4 dyes, its severity is likely to increase in standardized flow cells sequenced using 2 dyes, as the base call becomes inherently more challenging. In some implementations, an innovative process is used to generate sets of mixed length NRUMI, uniquely identifying such variable length NRUMIs («NRUMIs), and correcting errors among these VvNRUMIs. It offers diversity in the generation and distinction of DNA barcodes of heterogeneous length. Experimental results show that the vNRUMI method is more robust (that is, more capable of correcting sequencing errors) than conventional solutions.
[00104] [00104] In some implementations, a greedy algorithm is used to iteratively build sets of vWNRUMI. In each iteration, it takes a sequence from a pool of vNRUMI candidates so that the chosen sequence maximizes the minimum Levenshtein distance between S; and any VNRUMI that has already been chosen. If multiple strings share the maximum value of this metric, the algorithm chooses that string at random, preferring strings of shorter length. This distance metric is required to be at least 3 to impose good error correction within the resulting VvNRUMI set; if this condition cannot be met, the process stops adding new VYNRUMIs to the set, and returns the set as it is. This entire process can be repeated to generate different sets of vNRUMIs with similar characteristics.
[00105] [00105] Adapters can include physical UMIs that allow you to determine which strand of the DNA fragment the readings are derived from. Some modalities take advantage of this to determine a first consensus sequence for readings derived from a strand of the DNA fragment, and a second consensus sequence for the complementary strand. In many embodiments, a consensus sequence includes nucleotides detected in all or a majority of readings while excluding nucleotides that appear in a few readings. Different consensus criteria can be implemented. The process of combining readings based on UMIs or alignment sites to obtain a consensus sequence is also referred to as “collapse” of readings. Using physical UMIs, virtual UMIs, and / or alignment locations, it can be determined that readings for the first and second consensus strings are derived from the same double strand fragment. Therefore, in some embodiments, a third consensus sequence is determined using the first and second consensus sequences obtained for the same DNA fragment / molecule, with the third consensus sequence including common nucleotides for the first and second consensus strings while excluding these inconsistencies between the two. In alternative implementations, only one consensus sequence is directly obtained by collapsing all readings derived from both tapes of the same fragment, instead of comparing the two consensus sequences obtained from the two tapes. Finally, the fragment sequence can be determined from the third or only one consensus sequence, which includes base pairs that are consistent in readings derived from both strands of the fragment.
[00106] [00106] In some embodiments, the method combines different types of indices to determine the source polynucleotide from which readings are derived. For example, the method can use both as physical and virtual UMIs to identify readings derived from a single DNA molecule. Using a second form of UMI, in addition to the physical UMI, physical UMIs can be shorter than when only physical UMIs are used to determine the source polynucleotide. This approach has minimal impact on library staging performance, and does not require extra reading length for sequencing. D
[00107] [00107] Applications of the methods described include:
[00108] [00108] Figure 1A is a flow chart illustrating an exemplary workflow 100 for using UMIs in sequence nucleic acid fragments. Workflow 100 is illustrative of just a few implementations. It is understood that some implementations employ workflows with additional operations not shown here, while other implementations may skip some of the operations illustrated here. For example, some implementations do not require operation 102 and / or operation 104. Also, workflow 100 is employed for all genome sequencing. In some implementations involving targeted sequencing, operational steps to hybridize and enrich certain regions can be applied between operation 110 and 112.
[00109] [00109] Operation 102 provides double-stranded DNA fragments. DNA fragments can be obtained by fragmenting genomic DNA, collecting naturally fragmented DNA (for example, cfDNA or ctDNA),
[00110] [00110] In some implementations, fragmented or damaged DNA can be processed without requiring additional fragmentation. For example, formalin-embedded and formalin-fixed (FFPE) DNA or certain cfDNA are sometimes fragmented enough that no further fragmentation steps are required.
[00111] [00111] Figure IB shows a DNA fragment / molecule and the adapters used in the initial stages of workflow 100 in Figure 1A. Although only one fragment of double stripe is illustrated in Figure 1B, thousands to millions of fragments of a sample can be prepared simultaneously in the workflow. DNA fragmentation by physical methods produces heterogeneous ends, comprising a mixture of 3 'overhangs, 5' overhangs, and blunt ends. The protrusions will be of varying lengths and the ends may or may not be phosphorylated. An example of the double stranded DNA fragments obtained from fragmentation of operating genomic DNA 102 is shown as fragment 123 in Figure 1B.
[00112] [00112] Fragment 123 has both a 3rd protrusion on the left end and a 5th protrusion shown on the right end, and is marked with p and q, which indicate two strings in the fragment that can be used as virtual UMIs in some implementations, which, when used alone or combined with physical UMIs from an adapter to be connected to the fragment, they can uniquely identify the fragment. UMIs are uniquely associated with a single DNA fragment in a sample including a source polynucleotide and its complementary strand. A physical UMI is a sequence of an oligonucleotide attached to the source polynucleotide, its complementary strand, or a polynucleotide derived from the source polynucleotide. A virtual UMI is a sequence of an oligonucleotide from among the source polynucleotide, its complementary strand, or a polynucleotide derived from the source polynucleotide. Among this scheme, one can also refer to the physical UMI as an extrinsic or exogenous UMI, and the virtual UMI as an intrinsic or endogenous UMI.
[00113] [00113] The two p and q sequences actually refer to two complementary sequences at the same genomic site, but for the sake of simplicity, they are indicated on just one strand in some of the double strand fragments shown here. Virtual UMIs such as p and q can be used in a later step of the workflow to help identify readings originating from one or both strands of the single source DNA fragment. With the readings thus identified, they can be collapsed to obtain a consensus sequence.
[00114] [00114] If DNA fragments are produced by physical methods, workflow 100 proceeds to perform end repair operation 104, which produces blunt end fragments having 5'-phosphorylated ends. In some implementations, this step converts the protrusions resulting from fragmentation to blunt ends using DNA polymerase T4 and Enzyme Klenow. The exonuclease activity 3 to 5 of these enzymes removes 3 'bumps and the polymerase activity 5 to 3 fills the 5' bumps. In addition, polynucleotide kinase T4 in this reaction phosphorylates the 5 'ends of the DNA fragments. Fragment 125 in Figure IB is an example of a blunt-ended, end-repaired product.
[00115] [00115] After edge repair, workflow 100 proceeds to operation 106 to adenilate 3 'ends of the fragments, which is also referred to as A-tailing or dA-tailing, since a single dATP is added to the 3 ° ends of the blind fragments to prevent them from attaching to each other during the adapter attachment reaction. Double-stranded molecule 127 of Figure 1B shows an A-tailed fragment having blunt ends with 3'-dA protrusions and 5 'phosphate ends. A single "T 'nucleotide at the 3rd end of each of the two sequencing adapters as seen in item 129 of figure IB provides a complementary protrusion to the 3'-dA protrusion at each end of the insert to connect the two adapters to the insert.
[00116] [00116] After adenylation of the 3 'ends, the workflow 100 proceeds to operation 108 to partially connect double tape adapters to both ends of the fragments. In some implementations, the adapters used in the reaction include different physical UMIs to associate sequence readings with a single source polynucleotide, which can be a single or double stranded DNA fragment. In some implementations, a set of physical UMIs used in the reaction are random UMIs. In some implementations, the set of physical UMIs used in the reaction are non-random UMIs (NRUMIs). In some implementations, the set of physical UMIs used in the reaction are non-random variable length UMIs (vYNRUMIs).
[00117] [00117] Item 129 of figure IB illustrates two adapters to be connected to the double tape fragment that includes two virtual UMIs p and q near the ends of the fragment. These adapters are illustrated based on the sequencing adapters on the Ilumina platform, as several implementations can use Illumina's NGS platform to obtain readings and detect sequences of interest. The adapter shown on the left includes the physical UMI in its double-tape region, while the adapter on the right includes the physical UMI | in your double-tape region. In the tape having the 5th denatured end, from the 5th to 3rd direction, adapters have a P5 sequence, an index sequence, a reading primer sequence 2, and a physical UMI (a or à). In the tape having the 3rd denatured end, from the 3rd to 5th direction, the adapters have a P7 'sequence, an Index sequence, a reading primer sequence 1, and the physical UMI (a or B).
[00118] [00118] The oligonucleotides PS and P7 'are complementary to the amplification primers coupled to the surface of flow cells of the Ilumina sequencing platform. In some implementations, the index sequence provides a means to trace the source of a sample, thus allowing multiplexing of multiple samples on the sequencing platform. Other adapter designs and sequencing platforms can be used in various implementations. Adapters and sequencing technology are further described in the following sections.
[00119] [00119] The reaction represented in figure IB adds distinct sequences to the genomic fragment. A ligation product 120 of the same fragment described above is illustrated in figure 1B. This link product 120 has the physical UMI a, the virtual UMI p, the virtual UMI 6, and the physical UMI BB on its upper ribbon, in the 5-3 direction. The bonding product also has the physical UMI B, the virtual UMI q, the virtual UMI p, and the physical UMI a on its bottom ribbon, in the 5º-3 direction. These methods of the description modalities using sequencing technologies and adapters different from those provided by Illumina.
[00120] [00120] Although the sample adapters here have the physical UMIs in the double-tape regions of the adapters, some implementations use adapters having physical UMIs in the single-tape regions, such as adapters (i) and (iv) in figures 2A.
[00121] [00121] In some implementations, the products of this binding reaction are purified and / or selected by size by electrophoresis on agarose gel or magnetic beads. DNA selected by size is then amplified by PCR to enrich fragments that have adapters at both ends. See block 110. As mentioned earlier, in some implementations, operations to hybridize and enrich certain regions of the DNA fragments can be applied to target the regions for sequencing.
[00122] [00122] Workflow 100 then proceeds to group amplified PCR products, for example, on an Ilumina platform. See operation 112. By grouping PCR Products, libraries can be grouped for multiplexing, for example, with up to 12 samples per range, using different index strings on the adapters to track different samples.
[00123] [00123] After amplification of the set, sequencing readings can be obtained through sequencing by synthesis on the Ilumina platform. See operation 114. Although the adapters and sequencing process described here are based on the Ilumina platform, other sequencing technologies, especially NGS methods, can be used instead of or in addition to the Ilumina platform.
[00124] [00124] Workflow 100 can collapse readings having the same physical UMI (s) and / or the same virtual UMI (s) in one or more groups, thus obtaining a or more consensus strings. See the operation
[00125] [00125] Finally, workflow 100 uses one or more consensus strings to determine the sequence of the sample nucleic acid fragment. See operation 118. This may involve determining the sequence of the nucleic acid fragment as the third consensus sequence or the only consensus sequence described above.
[00126] [00126] In a particular implementation that includes operations similar to operations 108-119, a method for sequencing nucleic acid molecules in a sample using non-random UMIs involves the following: (a) applying adapters to DNA fragments in the sample to obtain products DNA adapter, where each adapter comprises a NRUMI, and where NRUMIs of the adapters have at least two different molecular lengths, forming a set of vNRUMIs; (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vNRUMIs; (d) identify, among the plurality of readings, readings associated with the same VNRUMI; and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same VYNRUMI.
[00127] [00127] In another implementation, random UMIs of varying length are used for the sequencing of nucleic acid molecules. The method includes: (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique molecular index (UMI), and where the unique molecular indexes (UMIs) of the adapters are at least at least two different molecular lengths and form a set of unique molecular indices of varying length (vUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vUMIs; and (d) identify, among the plurality of readings, readings associated with the same single non-random molecular index of variable length (vUMI). Some implementations include additionally determining a sequence of a DNA fragment in the sample using the readings associated with the same vUMI.
[00128] [00128] In some implementations, the UMIs used for sequencing nucleic acid fragments can be random fixed-length UMIs, non-random fixed-length UMIs, random variable-length UMIs, non-random variable-length UMIs, or any combination of themselves. In these implementations, the method for sequencing nucleic acid fragments includes: (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique molecular index (UMI) in a set of molecular indices unique (UMIs); (b) amplifying the DNA adapter products to obtain a plurality of polynucleotides - amplified; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of UMIs; (d) obtain, for each reading of the plurality of readings, alignment scores in relation to the set of UMIs, each alignment score indicating similarity between a subsequence of a reading and a UMI; (e) identify, among the plurality of readings, readings associated with the same UMI using the alignment scores; and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same UMI. In some implementations, alignment scores are based on nucleotide pairings and nucleotide edits between the subsequence of reading and the UMI. In some implementations, each alignment score penalizes mismatches at the beginning of a sequence, but does not penalize mismatches at the end of the sequence.
[00129] [00129] In some implementations, the sequence readings are paired end readings. Each reading includes a non-random UMI or is associated with a non-random UMI through a paired end reading. In some implementations, the reading lengths are shorter than the DNA fragments or shorter than half the length of fragments. In such cases, the complete sequence of the entire fragment is sometimes undetermined. Instead, the two ends of the fragment are determined. For example, a DNA fragment can be 500 bp long, from which two 100bp paired end readings can be derived. In this example, the bases 100 at each end of the fragment can be determined, and the 300 bp in the middle of the fragment may not be determined without using information from other readings. In some implementations, if the two paired-end readings are long enough to overlap, the complete sequence of the entire fragment can be determined from the two readings. For example, see the example described in connection with figure 5.
[00130] [00130] In some implementations, an adapter has a duplex non-random UMI in the double-tape region of the adapter, and each reading includes a first non-random UMI at one end and a second non-random UMI at the other end. Method for sequencing nucleic acid fragments using vWNRUMIs
[00131] [00131] In some implementations VvNRUMIs are incorporated in adapters for the sequencing of DNA fragments. VNRUMIs provide a mechanism to suppress different types of errors that occur in a workflow such as the one described earlier. Some of the errors can occur in the sample processing phase such as deletions, additions, and substitutions in sample processing. Other errors can occur in the sequencing phase. Some errors can be located in bases derived from DNA fragments, other errors can be located in bases corresponding to the UMIs in the adapters.
[00132] [00132] Some implementations provide a new process to detect and correct errors in vNRUMIs and in sequence readings. At a high level, given a reading containing a vNRUMI (potentially misread) and its bases downstream, the process uses a global-local (glocal) hybrid alignment strategy to pair the first few bases of the reading with a known vNRUMI, thus obtaining alignment scores between reading prefix sequences and the known vWNRUMI. A vNRUMI having a higher glocal alignment score is determined as the vYNRUMI associated with reading, which provides a mechanism to collapse the reading with other readings associated with the same vNRUMI, thus correcting errors. Pseudocode for obtaining glocal alignment scores and pairing with vNRUMIs using glocal alignment scores in some implementations is provided below. glocal algorithm: input: DNA sequences x and y Integral scores for (pairing, mismatch, gap), standard (1, -1, -1) output: z, an integral value that increases with sequence similarity scores = length numeric matrix (x) +1 rows and length (y) +1 columns for i from O to length (x), including: scores [i] [0] = i for j from da of length (y), including: scores [60] [3] = j for i from 1 to length (x), inclusive: for j from 1 to length (y), including: cost = pairing if x [i-1] == y [3-1], otherwise cost = mismatch establish maximum scores [i] [3] of: scores [i-1] [j-1] + cost scores [i-1] [3] + gap scores [i] [3-1] + gap z = maximum in the last row and last column of the score matrix return z match algorithm vNRUMI: input: set X containing all valid / unmuted VNRUMI sequence Q, a possibly mutated vVNRUMI and bases downstream output: mi the set of matching vVvNRUMI most likely m> the set of if second most likely VNRUMI pairings potentialLengths = unique lengths of all sequences in X matchScores = list containing potential matches for Q and their corresponding scores n = maximum length of any sequence in the set X subseq = first n bases in Q for all sequence S in X: record glocal score (S, subseg) in matching score, along with the sequence S in S; m = X sequences with highest observed glocal scores Mm, = X sequences with second highest observed glocal scores return m and m,
[00133] [00133] It is interesting to note the use of a distance metric
[00134] [00134] In some implementations, neither a traditional Needleman-Wunsch global alignment method nor a traditional Smith-Waterman local alignment method is used, but a new hybrid approach is used. Namely, the alignment uses a Needleman-Wunsch approach at the beginning of the alignment, penalizing edits there, but takes advantage of local Smith Waterman alignment concepts at the end of the alignment not penalizing end edits. In this sense, the current alignment approach encompasses both a global and a local component, and is therefore referred to as a glocal alignment approach. In the case of an insertion error or deletion in the sequencing, the alignment would change considerably. This global approach would not penalize that single occurrence any more than it would penalize - a single point mutation. Allowing for later gaps allows that to be achieved.
[00135] [00135] The glocal alignment approach has the ability to operate with heterogeneous length barcode pool, a distinctive feature of conventional methodologies.
[00136] [00136] In the identification of pairings, some implementations may return multiple pairings of vWNRUMI as the “best” when there are ties. Although the previous pseudocode only reflects the best and the second best set returned, some implementations have the ability to return more than just two sets of VvNRUMISs, such as a second best set, a third best set, a fourth best set, etc. By providing more information on good matches, the process can better correct errors by collapsing the readings associated with one or more candidate matches of VNRUMIs. Figure 1C is a block diagram showing a process for sequencing DNA fragments using vNRUMIs to suppress errors occurring in the DNA fragments and errors in the UMIs that are used to label the source molecules of the DNA fragments. Process 130 begins by applying adapters to DNA fragments in a sample to obtain DNA adapter products. See block 131. Each adapter in the adapters has a unique nonrandom molecular index. The adapters 'unique nonrandom molecular indices have at least two different molecular lengths and form a set of variable non-random molecular indices (' NRUMIs).
[00137] [00137] In some implementations, an adapter is affixed, attached, inserted, incorporated, or otherwise attached to each end of the DNA fragments. In some implementations, the sample containing the DNA fragments is a blood sample. In some implementations the DNA fragments contain cell-free DNA fragments. In some implementations, DNA fragments include cell-free DNA originating from a tumor, and the sequence of DNA fragments in the sample is indicative of the tumor.
[00138] [00138] Process 130 proceeds to amplify the DNA adapter products to obtain a plurality of amplified polynucleotides. See block 132. Process 130 additionally involves sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vNRUMIs. See block 133. In addition, process 130 involves identifying readings associated with the same VNRUMI among the plurality of readings. See block 134.
[00139] [00139] As mentioned earlier, process 130 illustrated in figure IC provides a method for sequencing DNA fragments using VNRUMIs. Process 130 begins by applying adapters to DNA fragments in the sample to obtain DNA adapter products (block 131). Process 130 also involves amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides (block 132); sequencing the quality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vNRUMIs (block 133); identify readings associated with the same vNRUMI (block 134); and determining a sequence of DNA fragments in the sample using the readings associated with the same vNRUMI (block 135). The sample can be a blood sample, a plasma sample, a tissue sample, or one of the samples as described elsewhere here. In some implementations, the adapters applied in step 131 can be obtained from a process such as process 140 illustrated in figure 1D.
[00140] [00140] In some implementations, the adapter vNRUMIs have at least two different molecular lengths. In some implementations, the set of vNRUMIs has two different molecular lengths. In some implementations, vNRUMIs have six or seven nucleotides. In some implementations, vNRUMIs have more than two different molecular lengths, such as having three, four, five, six, seven, eight, nine, ten, twenty, or more different molecular lengths. In some implementations, molecular lengths are chosen from the range 4-100. In some implementations, molecular lengths are chosen from the range 4-20. In some implementations, molecular lengths are chosen from range 5-15.
[00141] [00141] In some implementations, the set of VvWNRUMIs includes no more than about 10,000 different vVuNRUMIs. In some implementations, the set of vNRUMIs includes no more than about 1000 different vNRUMIs. In some implementations, the set of VNRUMIs includes no more than about 200 different vNRUMIs.
[00142] [00142] In some implementations, step 134 of identifying readings associated with the same vNRUMI involves obtaining, for each reading of the plurality of readings, alignment scores in relation to the vYNRUMIs. Each alignment score indicates similarity between a subsequent reading and a vNRUMI. The subsequence is in a reading region in which nucleotides derived from vNRUMI are likely to be located. In other words, in some implementations, the subsequence includes the first nucleotides in a region where VNRUMI is expected to be located. In some implementations, the subsequence size is equivalent to the size of the largest YNRUMI in the set of vNRUMIs.
[00143] [00143] In some implementations, the alignment scores are based on pairings and mismatches / edits - of nucleotides between the subsequent reading and vNRUMI. In some implementations, nucleotide edits include nucleotide substitutions, additions, and deletions. In some implementations, the alignment score penalizes edits at the beginning of a sequence (for example, a subsequence of a reading or a reference sequence from a VNRUMLD), but does not penalize edits at the end of the sequence. The alignment score reflects the similarity between the reading sequence and the vWNRUMI reference sequence.
[00144] [00144] In some implementations, obtaining an alignment score between reading and VWNRUMI involves: (a) calculating an alignment score between Y'NRUMI and each of all possible prefix sequences of the reading sequence; (b) calculate an alignment score between the
[00145] [00145] In some implementations, the reading subsequence has a length that is equal to the length of the longest YNRUMI in the set of vNRUMIs.
[00146] [00146] In some implementations, identifying the readings associated with the same VNRUMI includes selecting, for each reading of the plurality of readings, at least one vNRUMI from the set of vNRUMIs based on the alignment scores; and associate each reading of the plurality of readings with at least one vNRUMI selected for the reading. In some implementations, selecting at least one vNRUMI from the set of vWNRUMIs includes selecting a vNRUMI having the highest alignment score among the set of VvWNRUMI.
[00147] [00147] In some implementations, a vNRUMI is identified by a higher alignment score. In some implementations, two or more vNRUMIs are identified by the highest alignment score. In such a case, contextual information about the readings can be used to select one of the two or more Y'NRUMIs that should be associated with the readings to determine the sequence in the DNA fragments. For example, the total number of readings identified for one vNRUMI can be compared to the total number of readings identified for another vNRUMI, and a higher total number determines that vWNRUMI that should be used to indicate the source of the DNA fragment. In another example, reading sequence information or reading locations in a reference sequence can be used to select one of the identified vNRUMI associated with the readings, the selected vNRUMI being used to determine the source of the sequence readings.
[00148] [00148] In some implementations, two or more of the highest alignment scores can be used to identify two or more VNRUMIs to indicate the potential source of any fragment. Contextual information can be used as mentioned earlier to determine which of the Y'NRUMIs indicates the actual source of the DNA fragment.
[00149] [00149] Figure 1E shows examples of how a substring of a reading or a query sequence (Q) can be compared to two reference sequences in the set of Vv'NRUMI y = (S1,82) = (AACTTC, CGCTTTCG) . Query sequence Q includes the first seven nucleotides in the reading sequence where readings are expected to be derived from VvNRUMIs.
[00150] [00150] The Q query string includes seven GTCTTCG nucleotides. Q is the same length as the longest VYNRUMI in the set of vNRUMI 7. The alignment score table 150 shows the alignment scores for the prefix strings of Q and S1. For example, cell 151 shows the alignment score for the prefix sequence of Q (GTCTTC) and the complete sequence of S1 (AACTTC). The alignment score takes into account the number of matches between the two sequences, as well as the number of editions between the two sequences. For each paired nucleotide, the score goes up by 1; for each deletion, addition, or substitution, the score goes down by 1. In contrast, the Levenshtein distance is an editing distance, which does not consider the number of matches between two strings, but only considers the number of additions, deletions and substitutions.
[00151] [00151] Comparing the prefix sequence of Q (GTCTTC) and S1 (AACTTO) nucleotide by nucleotide, there is a mismatch between G and A, a mismatch between T and A, a match between C and C, a match between T and T , a match between T and T, and a match between C and C. Therefore, the alignment score for the two prefix strings is 2 as shown in cell 151. The alignment score does not penalize the end of sequence Q having a nucleotide G.
[00152] [00152] In the alignment score table 150 the rightmost column with the alignment scores in bold shows the alignment scores between all possible subsequences of the query sequence Q and all possible prefix sequences of the reference vNRUMI sequence S1. The bottom line of the alignment score table 150 shows the alignment scores between the complete sequence S1 and all possible prefix sequences of Q. In various implementations, the highest alignment score in the rightmost column and in the bottom line is selected as the glocal alignment score between Q and S1. In this example, cell 151 has the highest value, which is determined as the glocal alignment score between Q and S1, or g (Q, S1).
[00153] [00153] The highest alignment score in the bottom row and in the rightmost column is used as a glocal alignment score between two sequences. Different chain operations are also weighed against the alignment scores illustrated here. An alignment score is calculated as: number of matches - number of insertions - number of eliminations - number of substitutions = number of matches - Levenshtein distance. However, as mentioned earlier, in some implementations, different chain operations can be weighed differently when calculating an alignment score. For example, in some implementations (not shown in figure 1E), an alignment score can be calculated as: number of matches x 5 - number of insertions x 4 - number of eliminations x 4 - number of substitutions x 6, or using others weight values.
[00154] [00154] In the implementations described above, the alignment scores combine the effects of pairings and edits in a linear manner, namely by addition and / or subtraction. In other implementations, alignment scores can combine the effects of matching and editing in a non-linear manner such as by multiplication or logarithmic operations.
[00155] [00155] The alignment scores in the rightmost column and in the bottom line indicate similarity between prefix sequences on the one hand and a complete sequence on the other. When the beginning of a prefix sequence does not match the beginning of the complete sequence, the alignment score is penalized. In this sense, the alignment score has a global component. On the other hand, when the end of a prefix sequence does not match the end of the complete sequence, the alignment score sequence is not penalized. In this sense, the alignment score has a local component. Therefore, the alignment scores in the rightmost column and the bottom row can be described as “glocal” alignment scores. The glocal alignment score between Q and S1 is the highest alignment score in the rightmost row and in the lower column, which is 2 and in cell 151 for the prefix sequence QGTCTTC and S1 (AACTTC).
[00156] [00156] The Levenshtein distance between the prefix sequence Q GTCTTC and SI (AACTIC) is also 2, since there is a mismatch between G and A, a mismatch between T and A, and four matches for CTTC. For these two sequences, the Levenshtein distance and the alignment score are the same.
[00157] [00157] Compared to a glocal alignment score, a pure global alignment score requires the complete Q sequence on the one hand and the complete S1 sequence on the other hand, which is the alignment score in the lower right corner of table 150 .
[00158] [00158] Table 152 in figure 1E shows the alignment scores for the consultation sequence Q and reference sequence S2 (CGCTTCG). The highest alignment score in the rightmost column and in the bottom row is in cell 153, having a value of 4. It is the glocal alignment score between Q and S2, or g (Q, S82). The Levenshtein distance between Q and S2 is identical to the Levenshtein distance between Q and S1, since there are two mismatches between the two sequences in both comparisons. However, g (Q, S82) is greater than g (Q, S1), since there are more nucleotides paired between Q and S2 than between Q and S1. Namely, glocal alignment scores consider not only nucleotide editions (such as the Levenshtein distance), but also nucleotide pairings between sequences.
[00159] [00159] Figure 1E illustrates that the glocal alignment score can provide better error correction than the Levenshtein distance or the editing distance, since the Levenshtein distance considers only the number of editions in the sequence, while the alignment score glocal considers both the number of edits and the number of pairings between strings. The IF figure provides an example that illustrates that the glocal alignment score can provide better suppression of error than the global alignment score, since the glocal alignment score does not overpenalize = mismatches “due to insertion, deletion, or substitution in the end of the sequence.
[00160] [00160] The example in figure IF uses a different set of vNRUMI strings, y = (S1,52) = (TTIGTGACGGCCAT). In the process of processing the sample S1 is used to label a DNA molecule. This molecule sequence is ma = TTGTGACTNNNNN (SEQ ID NO: 1). During sequencing, a single insertion error occurs and the GCA sequence is inserted in m., Creating m = TTGGCATGACTNNNNN SEQ ID NO: 2). To correct this error and retrieve the appropriate UMI for that sequence, a process takes the first 7 base pairs as the query string, Q = TTGGCAT. The process compares Q with each sequence in y.
[00161] [00161] An alignment score table 160 for g (Q, S1) is obtained and shown in figure 1F. Similarly, an alignment score table 163 is obtained for g (Q, S2).
[00162] [00162] If a global alignment scheme instead of a glocal alignment score is used, the score in the lower right corner in cells 161 and 164 would be used, which has a value of 2 in both cases. An ideal alignment of Q (TTIGGCAT) and S1 (TTIGTGAC) is aligning TTG-GCAT with TTGTG-AC, where dashes represent insertions or gaps. This alignment involves 5 pairings, 2 insertions, and 1 replacement, providing an alignment score 5-2-1 = 2. An ideal alignment of Q (TTGGCAT) and S2 (GGCCAT) is aligning TTGGC-EM and --GGCCAT. This alignment involves 5 matches and 3 insertions, providing an alignment score 5-3 = 2. Using a global alignment score, it cannot be conclusively determined which of S1 and S2 is most likely to be the actual VY: NRUMI.
[00163] [00163] However, using a glocal alignment scheme, which uses the maximum value in the last row and column, the process obtains an alignment score of 3 for the Q prefix sequence TTGGC and SI (TTGTGAC), which makes the glocal score for S1 and higher than the glocal score for S2 (2). As such, the process can correctly associate Q with S1.
[00164] [00164] Going back to figure IC, step 135 involves determining a sequence of DNA fragment in the sample using the readings associated with the same vYNRUMI. In some implementations, determining the sequence of the DNA fragment involves collapsing the readings associated with the same VvNRUMI to obtain a consensus sequence, which can be achieved as further described hereinafter. In some implementations, the consensus sequence is based on reading quality scores, as well as the reading sequence. In addition or alternatively, other contextual information such as the position of the readings can be used to determine the consensus sequence.
[00165] [00165] In some implementations, determining the sequence of the
[00166] [00166] In some implementations, determining the sequence of the DNA fragment involves identifying, among the readings associated with the same vNRUMI, readings sharing a common virtual UMI or similar virtual UMIs, where the common virtual UMIs are found in the DNA fragment. The method also involves determining the DNA fragment sequence using only readings that are either associated with the same VNRUMI or share the same virtual UMIs or cellular virtual UMIs.
[00167] [00167] In some implementations, sequencing adapters having vNRUMIs can be prepared by a process shown in figure 1D and further described hereinafter. Project of physical UMI UMIs
[00168] [00168] In some implementations of the adapters described earlier, the physical UMIs on the adapters include random UMIs. In some implementations, each random UMI is different from all other random UMIs applied to fragments of DNA. In other words, random UMIs are randomly selected without replacing a set of UMIs including all possible different UMIs given the sequence length (s). In other implementations, the random UMIs are randomly selected with substitution. In these implementations, two adapters can have the same UMI due to chance.
[00169] [00169] In some implementations, the physical UMIs used in a process are a set of NRUMIs that are selected from a pool of candidate sequences using a greedy approach that maximizes the differences between the selected UMIs as further described hereinafter. In some implementations, NRUMIs have variable or heterogeneous molecular lengths, forming a set of vNRUMIs. In some implementations, the candidate sequence pool is filtered to remove certain sequences before being provided to select a set of UMIs used in a reaction or process.
[00170] [00170] Random UMIs provide a greater number of unique UMIs than non-random UMIs of the same sequence length. In other words, random UMIs are more likely to be unique than non-random UMIs. However, in some implementations, non-random UMIs may be easier to manufacture and have higher conversion efficiency. When non-random UMIs are combined with other information such as sequence position and virtual UMI, they can provide an efficient mechanism for indexing the source molecules of DNA fragments. Construction of VvNRUMIs
[00171] [00171] In some implementations, sequencing adapters having vNRUMIs can be prepared by a greedy approach represented in figure 1D. The process involves (a) providing a set of oligonucleotide sequences having two different molecular lengths; and (b) selecting a subset of oligonucleotide sequences from the set of oligonucleotide sequences, all editing distances between oligonucleotide sequences in the subset meeting a threshold value. The subset of oligonucleotide sequences forms a set of vNRUMIs. The method also involves (c) synthesizing a plurality of sequencing adapters, the sequencing adapter having a hybridized double-stranded region, a single-stranded 5 'end, a single-stranded 3rd end as depicted in Figure 2A, and at least a vNRUMI in the set of
[00172] [00172] Figure 1D illustrates a process 140 for making sequencing adapters having vNRUMIs. Process 140 begins by providing a set of oligonucleotide sequences (B) having at least two different molecular lengths. See block 141.
[00173] [00173] In various implementations, non-random UMIs are prepared considering various factors, including, but not limited to, means of detecting errors within the UMI sequences, conversion efficiency, assay compatibility, GC content, homopolymers, and considerations of manufacturing.
[00174] [00174] In some implementations, prior to operation 141, some of the oligonucleotide sequences are removed from the complete set of all possible nucleotide permutations given the specific molecular lengths of the set of vVNRUMIs. For example, if VNRUMIs have molecular lengths of six and seven nucleotides, all possible sequence permutations include a complete pool of 4º + 4º = 20480 sequences. Certain oligonucleotide sequences are removed from the pool to provide the pool of B oligonucleotide sequences.
[00175] [00175] In some implementations, sequences of oligonucleotides having three or more consecutive identical bases are removed from the pool to provide the pool with. In some implementations, oligonucleotide sequences having a combined number of guanine and cytosine bases (G and C) of less than two are removed. In some implementations, oligonucleotide sequences having a combined number of guanine and cytosine bases of more than four are removed. In some implementations, oligonucleotide sequences having the same base in the last two positions of the sequence are removed. The sequence starts from the opposite end of the end affixed to the DNA fragments.
[00176] [00176] In some implementations, oligonucleotide sequences having a subsequence that matches the 3rd end of any sequencing primers are removed.
[00177] [00177] In some implementations, oligonucleotide sequences having a thymine (T) base at the last position of the nucleotide sequences are removed. A vYNRUMI affixed to an A-tail end of a processed nucleic acid fragment will result in a subsequent reading having the vNRUMI sequence and an annealed T base at the end of the vNRUMI sequence, the T base being complementary to base A in A -tail. Filtering candidate sequences having a T base in the last position avoids confusion between such candidate sequences and the subsequent readings derived from any v'WNRUMIs.
[00178] [00178] Process 140 proceeds by selecting an oligonucleotide (So) sequence from B. See block 142. In some implementations, So can be randomly chosen from the set of oligonucleotide sequences.
[00179] [00179] Process 140 additionally involves adding So to an expansion set y of the oligonucleotide sequences and removing S, º from set B. See block 143.
[00180] [00180] Process 140 further involves selecting the oligonucleotide sequence S; from B, S; maximizes the distance function d (S,;, y), which is a minimum editing distance between S; and any oligonucleotide sequence in the y-set. See block 144. In some implementations, the editing distance is Levenshtein distance.
[00181] [00181] In some implementations, when the sequence is shorter than the maximum length of the vNRUMIs, one or more bases are attached to the end of the sequence when calculating the Levenshtein distance or the editing distance. In some implementations, if the sequence is a base shorter than the maximum length of the vNRUMIs, a thymine base (T) is added to the end of the sequence. This base T is added to reflect a base T protrusion at the end of a complementary adapter to base A at the end of a DNA fragment that has undergone dA-tailing processing as described here elsewhere. In some implementations, if the sequence is more than one base shorter than the maximum length of the vYvNRUMIs, a base T is added to the end of the sequence, and then one or more random bases are added after the base T to create a sequence having a molecular length that equals the maximum length of vNRUMIs. In other words, you can attach multiple different combinations of random bases after the T base to create sequences covering all possible observed sequences. For example, if vNRUMIs are 6 and 8 lengths, four derivations of a 6mer can be obtained by attaching TA, TC, TG, and TT.
[00182] [00182] Process 140 proceeds to determine whether the distance function d (S ;, y) meets the threshold value. In some implementations, the threshold value may require the distance function (for example, a filled Levenshtein distance) to be at least 3. If the distance function d (S ;, y) meets the threshold, the process proceeds to add S; to the expansion set y and remove S; that of the BB set. See decision “Yes” branch 145 and block 146. If the distance function does not meet the threshold value, process 140 does not add S; to the expansion set y, and The process proceeds to synthesize the plurality of sequencing adapters, where each sequencing adapter has at least one VNRUMI in the expansion set y. See decision branch no 145 pointing to block 148.
[00183] [00183] After step 146, process 140 additionally involves a decision operation on whether more sequences from set B need to be considered. If so, the process returns again to block 144 to select more oligonucleotide sequences from the set that maximizes the distance function. Several factors can be considered for
[00184] [00184] When it is decided that no more sequence needs to be considered, process 140 proceeds to synthesize the plurality of sequencing adapters where each adapter has at least one VNRUMI in the set of sequences y. See the non-decision branch of operation 147 pointing to operation 148. In some implementations, each sequencing adapter has VNRUMI on a ribbon from the sequencing adapters. In some implementations, sequencing adapters having any of the shapes illustrated in figure 2A are synthesized in operation 148. In some implementations, each sequencing adapter has only one VNRUMI. In some implementations, each adapter has a vNRUMI on each tape of the sequencing adapters. In some implementations, each sequencing adapter has a vNRUMI on each ribbon of the sequencing adapter in the hybridized region of the double ribbon.
[00185] [00185] In some implementations, the process can be implemented by the following pseudocode. vNRUMI dist algorithm: input: Set S of VNRUMI sequences, query sequence Q output: Integer d representing the distance from Qa S let distances be a list of all distances found for each sequence s in S: if length (s) < maximum length of any sequence in S: add a "T" as if length (Q) <maximum length of any sequence in S: add a "T" to Q add Levenshtein (s, Q) to distances return minimum value in distances algorithm generate vVNRUMI set: entry: Set X containing potential VNRUMI sequences / integer candidates N indicating number of desired VNRUMIS in the set output: set Y containing a set of at most N VvNRUMIS take a random element from X, add it to Y, remove it from X as number of strings in Y <N: store VNRUMI dist for every candidate in X against Y
[00186] [00186] Next, a didactic example is presented to illustrate how vVNRUMIs can be obtained according to the process and the previously described algorithm. The didactic example shows how vNRUMIs can be produced from a pool of five candidate sequences, which are then used to map readings of observed sequences. Note that since this is a didactic example in a significantly shorter sequence space than would be used / found in practice, not every aspect of the characteristics of V'NRUMIs can be addressed.
[00187] [00187] In this didactic example, the process aims to build a set of 3 vWNRUMI sequences starting from one of 6mer and 7mer set (but it resulted in only 2 vNRUML sequences). For simplicity, it is assumed that the entire 6mer and 7mer possible space consists of the following 5 sequences: AACTTC AACTTCA AGCTTCG CGCTTCG CGCTTC
[00188] [00188] Note that it is assumed that all 5 sequences have passed through any biochemical filters that are implemented. At a very high level, this algorithm forms subsets with the input sequence pool while maximizing an editing distance (a Levenshtein distance) between chosen sequences. He does this using a greedy approach - in each iteration he chooses a sequence that maximizes the distance function. The distance function, in this case, is the minimum editing distance between the sequence to be added and any sequence already found in the set. This can be mathematically expressed as follows: d (s, y) = min (levenshtein (s, x) V x E 7)
[00189] [00189] In the following example, the set of vNRUMI (n = 3) being constructed will be denoted as y, the set of candidate sequences will be denoted as B. y = (LB = (AACTTC, AACTTCA, AGCTTCG, CGCTTCG, CGCTTC ) Y
[00190] [00190] Since there are no y-sequences, the distance function d is undefined for each of the 5 sequences. In the case of a tie for better choice, we always take one of the candidates tied at random, preferring shorter strings. Here, the example takes the sequence 6mera AACTTC. It adds the string to y and removes it from the candidate string pool. y = (AACTTC) B = (AACTTCA, AGCTTCG, CGCTTCG, CGCTTC)
[00191] [00191] The distance metric d (s, y) VsEB is calculated.
[00192] [00192] d (AACTTCA, y) = 1, as it takes only one edit (addition of an A) to get from the single element in y to AACTTCA, and therefore the distance function is 1.
[00193] [00193] d (AGCTTCG, y) = 2, as it takes two editions to go from that sequence to the sequence already found in y.
[00194] [00194] d (CGCTTCG, y) = 3, as it takes three editions to go from this sequence to the sequence already found in y.
[00195] [00195] d (CGCTTC, y) = 2, as the sequence being compared is a hexamer, in some implementations, a “T” base is added to the end of it to simulate the annealing process, in which a T base complementary to the tail “A” is annealed to the adapter string. The logic is that when professionals try to identify NRUMI later, they will be considering both the first hexamer and the first heptamer. Adding this base T, it is ensured that when looking at the heptamers, it is still not too close to any other NRUMI. Comparing CGCTTCT to AACTTC, there are two editions required.
[00196] [00196] Since the maximum distance function is 3, produced by the sequence CGCTTCG, and that distance passes that minimum threshold (of 3), the process adds CGCTTCG to y and removes it from B.
[00197] [00197] Then, the process proceeds to calculate the distance metric d (s, y) VsERB since there are less than the desired number (3) of sequences in the VvNRUMI set.
[00198] [00198] d (AACTTCA, y) = 1. As calculated in the previous step, the editing distance between this sequence and the first vWNRUMI sequence, Ssi = AACTTC, is 1. The editing distance between this sequence and the second vNRUMI sequence, s. = CGCTTCG, is 3. A distance function takes the minimum of all editing distances between the query sequence and any existing sequence, and min (3.1) = 1 so that the distance function is
[00199] [00199] d (AGCTTCG, y) = 1. As calculated in the previous step, the editing distance between this sequence and s, is 2. The editing distance between this sequence and s, is 1. Therefore, the distance function is the smallest between 2 and 1 (which is 1).
[00200] [00200] d (CGCTTC, y) = 1. As before, the process appends a T to that sequence to make it CGCTTCT. The distance between the extended consultation and you is 2, as previously determined. The distance between the extended query and s, is 1, so the distance function is 1.
[00201] [00201] Having calculated all the distance functions for all candidate sequences, none of them satisfies that invariant requirement of an editing distance of at least 3. This requirement very unlikely causes random mutations to mutate a VWWRUMI sequence into something like another. Therefore, this set of 2 is returned
[00202] [00202] Turning to virtual UMI, those virtual UMIs that are defined in, or in relation to, the end positions of source DNA molecules can uniquely or approximately uniquely define individual source DNA molecules when sites end positions are generally random as with some fragmentation procedures and with naturally occurring cfDNA. When the sample contains relatively few source DNA molecules, virtual UMIs can uniquely identify individual source DNA molecules. Using a combination of two virtual UMIs, each UMI associated with a different end of a source DNA molecule, increases the likelihood that virtual UMIs alone can uniquely identify source DNA molecules. Of course, even in situations where one or two virtual UMIs cannot uniquely identify source DNA molecules, the combination of such virtual UMIs with one or more physical UMIs can succeed.
[00203] [00203] If two readings are derived from the same DNA fragment, two subsequences having the same base pairs will also have the same relative location in the readings. Otherwise, if the two readings are derived from two different DNA fragments, it is unlikely that two subsequences having the same base pairs will have the same exact relative location in the readings. Therefore, if two or more subsequences of two or more readings have the same base pairs and the same relative location in the two or more readings, it can be inferred that the two or more readings are derived from the same fragment.
[00204] [00204] In some implementations, substrings at or near the ends of a DNA fragment are used as virtual UMIs. This design choice has some practical advantages. First, the relative locations of these subsequences in the readings are easily determined, since they are at or near the beginning of the readings and the system does not need to use a detour to find the virtual UMI. In fact, since the base pairs at the ends of the fragments are sequenced first, those base pairs are available even if the readings are relatively short. Furthermore, base pairs determined earlier in a long reading have a lower sequencing error rate than those determined later. In other implementations, however, substrings located far from the ends of the readings can be used as virtual UMIs, but their relative positions in the readings may need to be determined to infer that the readings are obtained from the same fragment.
[00205] [00205] One or more subsequences in a reading can be used as virtual UMIs. In some implementations, two subsequences, each subsequence traced from a different end of the source DNA molecule, are used as virtual UMIs. In various implementations, virtual UMIs are about 24 base pairs or shorter in length, about 20 base pairs or shorter, about 15 base pairs or shorter, about 10 base pairs or shorter, about 9 base pairs or shorter, about 8 base pairs or shorter, about 7 base pairs or shorter, or about 6 base pairs or shorter. In some implementations, virtual UMIs have a length of about 6 to 10 base pairs. In other implementations, virtual UMIs have a length of about 6 to 24 base pairs. Adapters
[00206] [00206] In addition to the adapter design described in example workflow 100 with reference to figure 1A above, other
[00207] [00207] Figure 2A (i) shows a standard Illuminates TruSegO dual index adapter. The adapter is partially double-stranded and is formed by annealing two oligonucleotides corresponding to the two strands. The two strands have a number of complementary base pairs (for example, 12-17 bp) that allow the two oligonucleotides to anneal at the end to be linked to a dsDNA fragment. A fragment of dsDNA to be ligated at both ends for paired end readings is also referred to as an insert. Other base pairs are not complementary on the two strips, resulting in a fork-shaped adapter having two flexible protrusions. In the example in Figure 2A (i), the complementary base pairs are part of reading primer 2 and reading primer sequence 1. Downstream of reading primer 2 is a single nucleotide overhang 3'- T, which provides a complementary protrusion to the single 3'-A nucleotide protrusion of a dsDNA fragment to be sequenced, which can facilitate hybridization of the two protrusions. The reading primer sequence is at the 5 'end of the complementary strip, to which a phosphate group is attached. The phosphate group is necessary to connect the 5 'end of the reading primer sequence 1 to the 3 "-A protrusion of the DNA fragment. In the strip having the flexible projection 5 ° (the upper strip), from 5 ° to 3 °, the adapter has a PS sequence, index sequence i5, and the reading primer sequence 2. On the tape having the flexible protrusion 3, from the 3rd to the 5th direction, the adapter has a sequence P7 ', an index sequence 17, and the sequence of reading primer 1. Oligonucleotides PS and P7 'are complementary to amplification primers coupled to the surface of flow cells on a
[00208] [00208] Figure 2A (11) shows an adapter having a single physical UMI replacing the index region 17 of the standard dual index adapter shown in figure 2A (i). This adapter design mirrors that shown in the example workflow described previously in association with figure IB. In certain embodiments, physical UMIs a and B are designed to be only on the 5th arm of the double ribbon adapters, resulting in bonding products that have only one physical UMI on each ribbon. In comparison, physical UMIs built into both strips of adapters result in connection products that have two physical UMIs on each tape, doubling the time and cost to sequence the physical UMIs. However, these methods of the description modalities employing physical UMIs on both strips of the adapters as depicted in figures 2A (111) -2A (vi), which provide additional information that can be used to collapse different readings to obtain consensus strings.
[00209] [00209] In some implementations, physical UMIs on adapters include random UMIs. In some implementations, physical UMIs on adapters include non-random UMIs.
[00210] [00210] Figure 2A (iii) shows an adapter having two physical UMIs added to the standard dual index adapter. The physical UMIs shown here can be random UMIs or non-random UMIs. The first physical UMI is upstream in the index sequence 17, and the second physical UMI is upstream in the index sequence i5. Figure 2A (iv) shows an adapter also having two physical UMIs added to the standard dual index adapter. The first physical UMI is downstream from the index sequence 17, and the second physical UMI is downstream from the index sequence 15. Similarly, the two physical UMIs can be random or non-random UMIs.
[00211] [00211] An adapter having two physical UMIs in the two arms of the single strand region, such as those shown in 2A (iii) and 2A (iv), can link two strands of a double stranded DNA fragment, whether a priori or a posteriori information associating the two non-complementary physical UMIs is known. For example, a researcher can know the UMI | and UMI 2 before integrating them to the same adapter in the projected shown in figure 2A (iv). This association information can be used to infer that readings having UMI 1 and UMI 2 are derived from two strands of the DNA fragment to which the adapter was attached. Therefore, you can collapse not only readings having the same physical UMI, but also readings having one of two non-complementary physical UMIs. Interestingly, and as discussed below, a phenomenon referred to as “UMI jump” can complicate the inference of association between physical UMIs in regions of single adapter tape.
[00212] [00212] The two physical UMIs on the two adapter strips in figure 2A (i) and figure 2A (iv) are not located in the same site or complementary to each other. However, these methods of the description modalities employing physical UMIs that are in the same place on two adapter strips and / or complementary to each other. Figure 2A (v) shows a duplex adapter in which the two physical UMIs are complementary in a region of double tape at or near the end of the adapter. The two physical UMIs can be random UMIs or non-random UMIs. Figure 2A (vi) shows an adapter similar to, but shorter than, that of figure 2A (v), but does not include the index sequences or the P5 and P7 'sequences complementary to cell surface amplification primers. flow. Similarly, the two physical UMIs can be random UMIs or non-random UMIs.
[00213] [00213] Compared to adapters having one or more physical tape UMIs in single tape arms, adapters having a UMI
[00214] [00214] In some embodiments, it may be advantageous to employ relatively short physical UMIs since short physical UMIs are easier to incorporate into adapters. In fact, shorter physical UMIs are faster and easier to sequence in amplified fragments. However, as physical UMIs become very short, the total number of different physical UMIs may become less than the number of adapter molecules required for sample processing. In order to provide enough adapters, the same UMI would have to be repeated on two or more adapter molecules. In such a scenario, adapters having the same physical UMIs can be linked to multiple source DNA molecules. However, these short physical UMIs can provide enough information, when combined with other information such as virtual UMIs and / or reading alignment locations, to uniquely identify readings as being derived from a particular source polynucleotide or DNA fragment.
[00215] [00215] Incidentally, in some implementations, reading breakdown is based on two physical UMIs at the two ends of an insert. In such implementations, two very short physical UMIs (for example, 4 bp) are combined to determine the source of DNA fragments, the combined length of the two physical UMIs providing enough information to distinguish between different fragments.
[00216] [00216] In various implementations, physical UMIs are about 12 base pairs or shorter in length, about 11 base pairs or shorter, about 10 base pairs or shorter, about 9 base pairs or shorter , about 8 base pairs or shorter, about 7 base pairs or shorter, about 6 base pairs or shorter, about 5 base pairs or shorter, about 4 base pairs or shorter , or about 3 base pairs or shorter. In some implementations where physical UMIs are non-random UMIs, UMIs are about 12 base pairs or shorter in length, about 11 base pairs or shorter, about 10 base pairs or shorter, about 9 pairs base or shorter, about 8 base pairs or shorter, about 7 base pairs or shorter, or about 6 base pairs.
[00217] [00217] Jump of UMI can affect the inference of association between physical UMIs in one arm or both arms of adapters, as in the adapters of figures 2A (11) - (iv). It has been observed that when applying these adapters to DNA fragments, amplification products may include a greater number of fragments having unique physical UMIs than the actual number of fragments in the sample.
[00218] [00218] Incidentally, when adapters having physical UMIs in both arms are applied, amplified fragments having a common physical UMI at one end must have another common physical UMI at the other end. However, sometimes this is not the case. For example, in the reaction product of an amplification reaction, some fragments may have a first physical UMI and a second physical UMI at its two ends; other fragments may have the second physical UMI and a third physical UMI; still other fragments can have the first physical UMI and the third physical UMI; fragments can furthermore have the third physical UMI and a fourth physical UMI, and so on. In this example, the source fragment (s) for these amplified fragments may be difficult to determine. Apparently, during the amplification process, the physical UMI may have been "removed" by another physical UMI.
[00219] [00219] A possible approach to address this UMI leap problem considers only fragments sharing both UMIS as derived from the same source molecule, while fragments sharing only one UMI will be excluded from the analysis. However, some of these fragments sharing only one physical UMI may in fact derive from the same molecule as those sharing both physical UMIs. Excluding fragments sharing only one physical UMI for consideration, useful information can be lost. Another possible approach considers any fragments having a common physical UMI as derived from the same source molecule. But this approach does not allow the combination of two physical UMIs at two ends of the fragments for analysis downstream. In fact, under any approach, for the previous example, fragments sharing the first and second physical UMIs would not be considered derivatives of the same source molecule as fragments sharing the third and fourth physical UMIs. This may or may not be true. A third approach can address the UMI hop problem using adapters with physical UMIs on both tapes in the single tape region, such as the adapters in figures 2A (v) - (vi). The description of a hypothetical mechanism underlying the UMI jump is further explained below.
[00220] [00220] Figure 2B illustrates a hypothetical process in which UMI jump occurs in a PCR reaction involving adapters having physical UMI on both tapes in the double tape region. The two physical UMIs can be random UMIs or non-random UMIs. The actual underlying UMI hopping mechanism and the hypothetical process described here do not affect the usefulness of the adapters and methods described here. The PCR reaction begins by providing at least one double-stranded DNA source fragment 202 and adapters 204 and 206. Adapters 204 and 206 are similar to the adapters illustrated in figure 2A (iii) - (iv). Adapter 204 has a P5 adapter string and a physical UMI al on its 5th arm. Adapter 204 also has an adapter string P7 'and a physical UMI 02 on its 3rd arm. Adapter 206 has a P5 adapter string and a physical UMI B2 on its 5th arm, and an adapter string P7 'and a physical UMI B1 on its 3rd arm. The process proceeds by connecting adapter 204 and adapter 206 to fragment 202, obtaining ligation product 208. The process proceeds by denaturing ligation product 208, resulting in a denatured single strip fragment 212. Meanwhile, a reaction mixture often includes residual adapters at this stage. Since even if the process already involved removing superabundant adapters such as using Solid Phase Reversible Immobilization (SPRD) spheres, some adapters are still left in the reaction mixture. Such remaining adapter is illustrated as adapter 210, which is similar to adapter 206, except that adapter 210 has physical UMISs y1 and y2 in its 3rd and 7th arms, respectively. The denaturation condition producing the denatured fragment 212 also produces a denatured oligonucleotide adapter 214, which has physical UMI y2 next to its adapter sequence P5.
[00221] [00221] The single strand adapter fragment 214 is then hybridized to the signal strand DNA fragment 212, and a PCR process extends the single strand adapter fragment 214 to produce an intermediate insert 216 that is complementary to the fragment of DNA 212. During the various PCR amplification cycles, intermediate adapter fragments 218, 220, and 222 may result from PCR extensions of P7 'strands of adapters including different physical UMIs 3, e, and C. The intermediate adapter fragments 218, 220, and 222 all have the sequence P7 'at the 5' end, and respectively have physical UMIs ô, e, and 6. In successive PCR cycles, intermediate adapter fragments 218, 220, and 222 can hybridize to the fragment intermediate 216 or its amplicons, since the 3rd end of intermediate adapter fragments 218, 220, and 222 are complementary to region 217 of intermediate insert 216. PCR extension of hybridized fragments produces fragment single-stranded DNA 224, 226, and 228. DNA fragments 224, 226, and 228 are labeled with three different physical UMIs (3, e, and O) at the 5th end, and a physical y2 UMI at the 3 'end , which indicate “UMI jump” where different UMIs are attached to nucleotide sequences derived from the same 202 DNA fragment.
[00222] [00222] In some implementations of the description, using adapters having physical UMIs on both tapes in the double tape region of the adapters, such as the adapters in figures 2A (v) - (vi), it is possible to prevent or reduce UMI jump . This may be due to the fact that the physical UMIs on one adapter in the double-tape region are different from the physical UMIs on all other adapters. This helps to reduce the complementarity between intermediate adapter oligonucleotides and intermediate fragments,
[00223] [00223] In various implementations using UMIs, multiple sequence readings having the same UMI (s) are collapsed to obtain one or more consensus sequences, which are then used to determine the sequence of a DNA molecule source. Multiple different readings can be generated from different occurrences of the same source DNA molecule, and these readings can be compared to produce a consensus sequence as described here. The occurrences can be generated by amplifying the source DNA molecule before sequencing, so that different sequencing operations are performed on different amplification products, each sharing the sequence of the source DNA molecule. Naturally, amplification can introduce errors so that the sequences of the different amplification products differ. In the context some sequencing technologies such as sequencing by Illumina synthesis, the source DNA molecule or an amplification product likewise a grouping of DNA molecules linked to a region of a flow cell. The group's molecules collectively provide a reading. Typically, at least two readings are required to provide a consensus sequence. Sequencing depths of 100, 1,000, and 10,000 are examples of sequencing depths useful in the modalities described to create consensus readings for low allele frequencies (for example, about 1% or less).
[00224] [00224] In some implementations, nucleotides that are consistent in 100% of readings sharing a UMI or combination of UMIs are included in the consensus sequence. In other implementations, the consensus criterion may be less than 100%. For example, a 90% consensus criterion can be used, which means that base pairs that exist in 90% or more of the readings in the group are included in the consensus sequence. In various implementations, the consensus criterion can be combined in about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 95 %, or about 100%. Collapse by physical UMIs and virtual UMIs
[00225] [00225] Multiple techniques can be used to collapse readings that include “multiple UMIs. In some implementations, readings sharing a common physical UMI can be collapsed to obtain a consensus string. In some implementations, if the common physical UMI is a random UMI, the random UMI may be unique enough to identify a particular source molecule of a DNA fragment in a sample. In other implementations, if the common physical UMI is a non-random UMI, the UMI may not be unique enough in itself to identify a particular source molecule. In both cases, a physical UMI can be combined with a virtual UMI to provide an index of the source molecule.
[00226] [00226] In the example workflow described above and represented in Figures 1B, 3A, and 4, some readings include UMIs 0-p-ç, while others include UMIs B- € -p. The physical UMI produces readings having a. If all adapters used in a workflow have different physical UMIs (for example, different random UMIs), all readings having the in the adapter region are likely to be derived from the same strand of the DNA fragment. Similarly, the physical UMI Bê produces readings having B, all of which are derived from the same complementary strand of the DNA fragment. It is therefore useful to collapse all readings including a to obtain a consensus sequence, and to collapse all readings including B to obtain another consensus sequence. This is illustrated as the first level collapse in figures 4B-4C. Since all readings in a group are derived from the same source polynucleotide in a sample, base pairs included in the consensus sequence are likely to reflect the true sequence of the source polynucleotide, while a base pair excluded from the consensus sequence probably reflects variation or error introduced in the workflow.
[00227] [00227] In addition, virtual UMIs p and q can provide information to determine that readings including one or both virtual UMIs are derived from the same source DNA fragment. Since virtual UMIs p and q are internal to the source DNA fragments, the exploration of virtual UMIs does not add overhead to the preparation or sequencing in practice. After obtaining the sequences of the physical readable UMIs, one or more subsequences in the readings can be determined as virtual UMIs. If virtual UMIs include enough base pairs and have the same relative location in readings, they can uniquely identify the readings as having been derived from the source DNA fragment. Therefore, readings having one or both virtual UMIs p and q can be collapsed to obtain a consensus string. The combination of virtual UMIs and physical UMIs can provide information for a second level collapse when only one physical UMI is assigned to a first level consensus sequence for each tape, as shown in figure 3A and figures 4A-4C. However, in some implementations, this second-level breakdown using virtual UMIs can be difficult if there is overabundance of DNA molecules or fragmentation is not randomized.
[00228] [00228] In alternative modalities, readings having two physical UMIs at both ends, such as those shown in figure 3B and figures 4D and 4E, can be collapsed into a second level collapse based on a combination of the physical UMIs and the Virtual UMIs. This is especially useful when physical UMIs are too short to uniquely identify source DNA fragments without using virtual UMIs. In these modalities, second-level collapse can be implemented, with physical duplex UMIs as shown in figure 3B, collapsing apqB consensus readings and Bqpa consensus readings from the same DNA molecule, thus obtaining a consensus sequence including consistent nucleotides among all of the readings.
[00229] [00229] Using UMI and the collapse scheme described here, several modalities can suppress different sources of error affecting the determined sequence of a fragment even if the fragment includes alleles with very low allele frequencies. Readings sharing the same UMIs (physical and / or virtual) are grouped together. By collapsing the grouped readings, variants (SNV and small indels) due to PCR, library preparation, grouping, and sequencing errors can be eliminated. Figures 4A-4E illustrate how a method as described in an exemplary workflow can suppress different sources of error in determining the sequence of a double-stranded DNA fragment. The illustrated readings include a-p- or B-E-p UMIs in figures 3A and 4A-4C, and a-p-q-B or B-E-p-a UMIs in figures 3B, 4D and 4E. UMIs a and B are physical singleplex UMIs in figures 3A and 4A-4C. UMIs a and B are duplex UMIs in figures 3B, 4D and 4E. The virtual UMIs p and q are located at the ends of a DNA fragment.
[00230] [00230] The method using physical singleplex UMIs as shown in figures 4A-4C first involves collapsing the readings having the same physical UMI a or B, illustrated as first level collapse. The first level collapse obtains a consensus sequence a for readings having the physical UMI a, readings which are derived from a double strand fragment tape. The first level collapse also obtains a consensus sequence B for readings having the physical UMI B, readings which are derived from another strand of the double strand fragment. In a second level breakdown, the method obtains a third consensus sequence from consensus sequence a and consensus sequence f. The third consensus sequence reflects base pairs of consensus readings having the same virtual duplex UMIs p and q, readings which are derived from two complementary tapes of the source fragment. Finally, the sequence of the double-stranded DNA fragment is determined as the third consensus sequence.
[00231] [00231] The method using physical duplex UMIs as shown in figures 4D-4E first involves collapsing the readings by having physical UMIs a and B with an order af in the 5'-3 direction, illustrated as first level collapse. The first level collapse obtains a consensus sequence a-B for readings having the physical UMIs a and B, readings which are derived from a first strand of the double strand fragment. The first level collapse also obtains a consensus sequence B-a for readings having the physical UMIs B and a with an order Ba in the 5-3 direction, readings which are derived from a second strand complementary to the first strand of the double strand fragment. In a second-level breakdown, the method obtains a third consensus sequence from the a-B consensus sequence and the B-a consensus sequence. The third consensus sequence reflects base pairs of consensus readings having the same virtual duplex UMIs p and q, readings which are derived from two strips of the fragment. Finally, the sequence of the double-stranded DNA fragment is determined as the third consensus sequence.
[00232] [00232] Figure 4A illustrates how a first level collapse can suppress sequencing errors. Sequencing errors occur on the sequencing platform after sample and library preparation (for example, PCR amplification). Sequencing errors can introduce different erroneous bases in different readings. True positive bases are illustrated by solid letters, while false positive bases are illustrated by hatched letters. False positive nucleotides at different readings in the a-p-p family were excluded from the consensus sequence a. The true positive polynucleotide "A" illustrated at the left end of readings from the a-p-Q family is retained for the consensus sequence a. Similarly,
[00233] [00233] PCR errors occur before cluster amplification. Therefore, an erroneous base pair introduced into a single stranded DNA by the PCR process can be amplified during cluster amplification, as soon as they appear in multiple clusters and readings. As illustrated in figure 4B and figure 4D, a false positive base pair introduced by PCR error can appear in many readings. The “T” base in the readings of the apq family (figure 4B) or oB (figure 4D) and the “C” base in the readings of the family Bqp (figure 4B) or Ba (figure 4D) are such PCR errors . In contrast, the sequencing errors shown in figure 4A appear in one or a few readings in the same family. Since PCR sequencing errors appear in many readings in the family, a first-level collapse of readings on a tape does not remove PCR errors, although the first-level collapse removes sequencing errors (for example, G and A removed of the apk family in figure 4B and the oB family in figure 4D). However, since a PCR error is introduced into a single stranded DNA, the complementary strand of the source fragment and readings derived from it usually do not have the same PCR error. Therefore, the second level collapse based on readings from the two strands of the source fragment can effectively remove PCR errors as shown at the bottom of figures 4B and 4D.
[00234] [00234] On some sequencing platforms, homopolymer errors occur to introduce small indel errors into repeated single nucleotide homopolymers. Figures 4C and 4E illustrate homopolymer error correction using the methods described here. In the readings of the family a-p-p (figure 4C) or a-p-p-B (figure 4E), two “T” nucleotides were deleted from the second reading from the top, and one “T” nucleotide was deleted from the third reading from the top. In the readings of the B-q-p (figure 4C) or B-p-p-a (figure 4E) family, a “T” nucleotide was inserted in the first reading from the top. Similar to the sequencing error illustrated in Figure 4A, homopolymer errors occur after PCR amplification, so different readings have different homopolymer errors. As a result, the first level collapse can effectively remove errors from indel.
[00235] [00235] Consensus strings can be obtained by collapsing the readings having one or more common non-random UMI and one or more common virtual UMI. In addition, position information can also be used for consensus strings obtained as described below. Breakdown by Position
[00236] [00236] In some implementations, readings are processed to align with a reference sequence to determine alignment locations of the readings in the reference sequence (location). However, in some implementations not previously illustrated, localization is achieved by analyzing k-mere similarity and reading-by-reading alignment. This second implementation has two advantages: first, it can collapse (error correction) readings that do not match the reference, due to differences in haplotypes or translocations, and secondly, it does not depend on an aligner algorithm, thus removing the possibility of aligner-induced artifacts (errors in the aligner). In some implementations, readings sharing the same location information can be collapsed to obtain consensus strings to determine the sequence of the source DNA fragments. In some contexts, the alignment process is also referred to as a mapping process. Sequence readings are subjected to an alignment process to be mapped to a reference sequence. Various alignment tools and algorithms can be used to align readings to the reference sequence as described elsewhere in the description. As usual, in alignment algorithms, some readings are successfully aligned to the reference sequence, while others may not be aligned successfully or may be misaligned to the reference sequence. Readings that are successively aligned to the reference sequence are associated with sites in the reference sequence. Aligned readings and their associated sites are also referred to as sequence markers. Some sequence readings that contain a large number of repetitions tend to be more difficult to align for the reference sequence. When a reading is aligned to a reference sequence with a number of mismatched bases above a certain criterion, the reading is considered to be misaligned. In various modalities, readings are considered to be misaligned when they are aligned with at least about 1, 2, 3,4, 5, 6,7, 8, 9, or 10 mismatches. In other modalities, readings are considered to be poorly aligned when they are aligned with at least about 5% mismatches. In other modalities, readings are considered to be poorly aligned when they are aligned with at least about 10%, 15%, or 20% of mismatched bases.
[00237] [00237] In some implementations, the methods described combine position information with physical UMI information to index molecules source of DNA fragments. Sequence readings sharing the same reading position and the same non-random or random physical UMI can be collapsed to obtain a consensus sequence to determine the sequence of a fragment or portion thereof. In some implementations, sequence reads sharing the same reading position, the same non-random physical UMI, and a non-random physical UMI can be collapsed to obtain a consensus sequence. In such implementations, the adapter can include both a non-random physical UMI and a non-random physical UMI. In some implementations, sequence readings sharing the same reading position and the same virtual UMI can be collapsed to obtain a consensus sequence.
[00238] [00238] Reading position information can be obtained through different techniques. For example, in some implementations, genomic coordinates can be used to provide reading position information. In some implementations, the position in a reference sequence to which a reading is aligned can be used to provide reading position information. For example, the start and stop positions of a reading on a chromosome can be used to provide reading position information. In some implementations, reading positions are considered to be the same if they have identical position information. In some implementations, reading positions are considered the same if the difference between the position information is less than a defined criterion. For example, two readings having genomic start positions that differ by less than 2, 3, 4, or 5, base pairs can be considered readings having the same reading position. In other implementations, reading positions are considered to be the same if their position information can be converted and paired into a particular position space. A reference sequence can be provided before sequencing - for example, it can be a well-known and widely used human genomic sequence - or it can be determined from readings obtained during sample sequencing.
[00239] [00239] Regardless of the specific sequencing protocol and platform, at least a portion of the nucleic acids contained in the sample are sequenced to generate tens of thousands, hundreds of thousands, or millions of sequence readings, for example, 100bp readings. In some embodiments, the sequence readings comprise about 20bp, about 25bp, about 30bp, about 35bp, about 36bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about 90bp, about 95bp, about 100bp, about 110bp, about 120bp, about 130, about 140bp, about 150bp , about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, about 500bp, about 500bp, about 800bp, about 1000bp, or about 2000bp.
[00240] [00240] In some modalities, readings are aligned to a reference genome, for example, hg19. In other embodiments, readings are aligned to a portion of a reference genome, for example, a chromosome or chromosome segment. Readings that are uniquely mapped to the reference genome are known as sequence markers. In one embodiment, at least about 3 x 10 th qualified sequence markers, at least about 5 x 10 th qualified sequence markers, at least about 8 x 10 th qualified sequence markers, at least about 10 x 10 th sequence markers qualified, at least about 15 x 10 th qualified sequence markers, at least about 20 x 10 th qualified sequence markers, at least about 30 x 10 th qualified sequence markers, at least about 40 x 10 th qualified sequence markers, or at least about 50 x 10 th qualified sequence markers are obtained from readings that map uniquely to a reference genome. applications
[00241] [00241] In several applications, error correction strategies as described here can provide one or more of the following benefits: (1) detect somatic mutations of very low allele frequency, (11) decrease cycle time by mitigating phasing / pre errors -phasing, and / or (1li) increase reading length, boosting basic call quality in the last part of readings, etc. The applications and logic regarding the detection of somatic mutations of low allele frequency are discussed previously.
[00242] [00242] In certain modalities, the techniques described here may allow reliable calling of alleles having frequencies of about 2% or less, or about 1% or less, or about 0.5% or less. Such low frequencies are common in cfDNA originating from tumor cells in a cancer patient. In some modalities, the techniques described here may allow the identification of rare strains in metagenomic samples, as well as the detection of rare variants in viral or other populations when, for example, a patient has been infected with multiple viral strains, and / or has undergone medical treatment.
[00243] [00243] In certain embodiments, the techniques described here may allow for shorter sequencing chemistry cycle time. The shortened cycle time increases sequencing errors, which can be corrected using the method described above.
[00244] [00244] In some implementations involving UMIs, long readings can be obtained from paired end sequencing using asymmetric read lengths for a pair of paired end (PE) readings from two ends of a segment. For example, a pair of readings having 50 bp in one paired end reading and 500 bp in another paired end reading can be "stitched" together with another pair of readings to produce a long 1000 bp reading. These implementations can provide faster sequencing speed to determine long
[00245] [00245] Figure 5 schematically illustrates an example of efficiently obtaining long paired end readings in this type of applications by applying physical UMIs and virtual UMIs. Libraries of both strands of the same DNA fragments are clustered in the flow cell. The library insert size is longer than 1Kb. Sequencing is performed with asymmetric reading lengths (for example, Leitural = 500 bp, Reading2 = 50 bp), to ensure the quality of long 500bp readings. By stitching two tapes, PE readings of 1000 bp in length can be created with only 500 + 50bp sequencing. Samples
[00246] [00246] Samples that are used to determine DNA fragment sequences can include samples taken from any cell, fluid, tissue, or organ including nucleic acids in which sequences of interest are to be determined. In some modalities involving the diagnosis of cancers, circulating tumor DNA can be obtained from a subject's body fluid, for example blood or plasma. In some modalities involving fetal diagnosis, it is advantageous to obtain cell-free nucleic acids, for example, cell-free DNA (cfDNA), from maternal body fluid. Cell-free nucleic acids, including cell-free DNA, can be obtained through a variety of methods known in the biological sampling technique including, but not limited to, plasma, serum, and urine (see, for example, Fan et al., Proc Natl Acad Sci 105: 16266-16271 [2008]; Koide et al., Prenatal Diagnosis 25: 604-607
[2005] [2005]; Chen et al., Nature Med. 2: 1033-1035 [1996]; Lo et al., Lancet 350: 485-487 [1997]; Botezatu et al., Clin Chem. 46: 1078-1084, 2000; and Su et al., J Mol. Diagn. 6: 101-107 [2004]).
[00247] [00247] In various modalities the nucleic acids (for example, DNA or RNA) present in the sample can be specifically enriched
[00248] [00248] The sample including the nucleic acids to which the methods described here are typically included includes a biological sample ("test sample") as described above. In some embodiments, the nucleic acids to be sequenced are purified or isolated by any of a number of well-known methods.
[00249] Consequently, in certain embodiments, the sample includes or consists essentially of a purified or isolated polynucleotide, or may include samples such as a tissue sample, a biological fluid sample, a cell sample, and the like. Suitable biological fluid samples include, but are not limited to, blood, plasma, serum, sweat, tears, sputum, urine, sputum, atrial flow, lymph, saliva, cerebrospinal fluid, damage, bone marrow suspension, vaginal flow, transcervical lavage, brain fluid, ascites, milk, secretions from the respiratory, intestinal and genitourinary tracts, amniotic fluid, milk, and leukophoresis. In some modalities, the sample is a sample that is easily obtainable by non-invasive procedures, for example, blood, plasma, serum, sweat, tears, sputum, urine, excrement, sputum, auricular flow, saliva or feces. In certain embodiments, the sample is a peripheral blood sample, or the plasma and / or serum fractions of a sample of
[00250] [00250] In certain embodiments, samples may be obtained from sources, including, but not limited to, samples from different individuals, samples from different stages of development from the same or different individuals, samples from different sick individuals (for example, individuals suspected of having a genetic disorder), normal individuals, samples obtained at different stages of a disease in an individual, samples obtained from an individual undergoing different treatments for a disease, samples from individuals subjected to different environmental factors, samples from individuals with pre-disposition to a pathology, individuals from samples with exposure to an infectious disease agent, and the like.
[00251] [00251] In an illustrative but not limiting modality, the sample is a maternal sample that is obtained from a pregnant female, for example a pregnant woman. In this case, the sample can be analyzed using the methods described here to provide a prenatal diagnosis of potential chromosomal abnormalities in the fetus. The maternal sample can be a tissue sample, a biological fluid sample, or a cell sample. A biological fluid includes, as non-limiting examples, samples of blood, plasma, serum, sweat, tears, sputum, urine, sputum, atrial flow, lymph, saliva, cerebrospinal fluid, damage, bone marrow suspension, vaginal flow, transcervical lavage , cerebral fluid, ascites, milk, secretions from the respiratory, intestinal and genitourinary tracts, and leukophoresis.
[00252] [00252] In certain embodiments samples can also be obtained from tissues grown in vitro, cells, or other sources containing polynucleotides. Cultured samples can be taken from sources including, but not limited to, cultures (for example, tissue or cells) maintained in different media and conditions (for example, pH, pressure, or temperature), cultures (for example, tissue or cells ) maintained for different length periods, cultures (for example, tissue or cells) treated with different factors or reagents (for example, a candidate drug, or a modulator), or cultures of different types of tissue and / or cells.
[00253] [00253] Methods of isolating nucleic acids from biological sources are well known and will differ depending on the nature of the source. One skilled in the art can readily isolate nucleic acids from a source as necessary for the method described here. In some cases, it may be advantageous to fragment the nucleic acid molecules in the nucleic acid sample. Fragmentation can be random, or it can be specific, as achieved, for example, using restriction endonuclease digestion. Methods for random fragmentation are well known in the art, and include, for example, limited DNAse digestion, alkaline treatment and physical shear. Sequencing library preparation
[00254] [00254] In several modalities, sequencing can be carried out on several sequencing platforms that require preparation of a sequencing library. Preparation typically involves fragmenting the DNA (sonification, nebulization or shearing), followed by DNA repair and edge polishing (blunt end or a protrusion), and specific adapter attachment to the platform. In one embodiment, the methods described here can use next generation sequencing technologies (NGS), which allow multiple samples to be sequenced - individually = as genomic cells (ie, singleplex sequencing) or as grouped samples comprising indexed genomic molecules (for example) example, multiplex sequencing) in a single sequencing run. These methods can generate up to several billion DNA sequence readings. In various embodiments, the sequences of genomic nucleic acids, and / or indexed genomic nucleic acids can be determined using, for example, the next generation sequencing technologies (NGS) described here. In various modalities, analysis of the massive amount of sequence data obtained using NGS can be performed using one or more processors as described here.
[00255] [00255] In several modalities the use of such sequencing technologies does not involve the preparation of sequencing libraries.
[00256] [00256] However, in certain embodiments the sequencing methods contemplated here involve the preparation of sequencing libraries. In an illustrative approach, preparing a sequencing library involves producing a random collection of adapter-modified DNA fragments (for example, polynucleotides) that are ready to be sequenced. Polynucleotide sequencing libraries can be prepared from DNA or RNA, including equivalents, DNA or cDNA analogs, for example, DNA or cDNA that is complementary or DNA copy produced from an RNA model, by the action of reverse transcriptase. Polynucleotides can originate in the form of double strand (for example, dsDNA such as genomic fragments of DNA, cDNA, PCR amplification products, and the like) or, in certain embodiments, polynucleotides can originate in the form of single strand ( for example, ssDNA, RNA, etc.) and have been converted to dsDNA form. THE
[00257] [00257] Preparation of sequencing libraries for some NGS sequencing platforms is facilitated by the use of polynucleotides comprising a specific range of fragment sizes. Preparation of such libraries typically involves the fragmentation of large polynucleotides (e.g., genomic cell DNA) to obtain polynucleotides in the desired size range.
[00258] [00258] Paired end readings can be used for the sequencing methods and systems described here. The length of the fragment or insert is longer than the reading length, and sometimes longer than the sum of the lengths of the two readings.
[00259] [00259] In some illustrative embodiments, the nucleic acid (s) in the sample is / are obtained (s) as genomic DNA, which is subjected to fragmentation into fragments of more than approximately 50 in length, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, or 5000 base pairs, to which NGS methods can be readily applied. In some embodiments, paired end readings are obtained from inserts of around 100-5000 bp. In some embodiments, the inserts are about 100-1000bp in length. These are sometimes implemented as regular short-insert paired end readings. In some embodiments, the inserts are about 1000-5000bp in length. These are sometimes implemented as long-term mate pair readings as previously described.
[00260] [00260] In some implementations, long inserts are designed to evaluate very long strings. In some implementations, mate pair readings can be applied to obtain readings that are spaced by thousands of base pairs. In these implementations, inserts or fragments range from hundreds to thousands of base pairs, with two biotin junction adapters at the two ends of an insert. Then the biotin junction adapters join the two ends of the insert to form a circularized molecule, which is then further fragmented. A subfragment including the biotin junction adapters and the two ends of the original insert is selected for sequencing on a platform that is designed to sequence shorter fragments.
[00261] [00261] Fragmentation can be achieved by any of a number of methods known to those skilled in the art. For example, fragmentation can be achieved by mechanical means including, but not limited to, nebulization, sonification and hydrocisation. However mechanical fragmentation typically cleaves the DNA structure into C- O, PO and CC bonds resulting in a heterogeneous mixture of blunt and 3rd and 5 * protruding ends with broken CO, PO and / CC bonds (see, for example, Alnemri and Liwack, J Biol. Chem 265: 17323-17333 [1990]; Richards and Boyer, J] Mol Biol 11: 327-240 [1965]) which may need to be repaired since they may not have 5'-phosphate required for the subsequent enzymatic reaction, for example, ligation of sequencing adapters, which are required to prepare DNA for sequencing.
[00262] [00262] In contrast, cfDNA typically exists as fragments of less than about 300 base pairs and therefore fragmentation is not typically necessary to generate a sequencing library using cfDNA samples.
[00263] [00263] Typically, if polynucleotides are forcibly fragmented (e.g., fragmented in vitro), or naturally exist as fragments, they are converted to blunt-ended DNA having 5'-phosphates and 3'-hydroxyl. Standard protocols, for example, sequencing protocols using, for example, the Ilumina platform as described in the previous example workflow with reference to figures 14 and 1B, instruct users to repair the sample DNA end, to purify the products of repaired end prior to adenylation or dA-tailing of the 3 'ends, and to purify the dA-tailing products prior to the adapter binding steps of the library preparation.
[00264] [00264] Various modalities of sequence library preparation methods described here obviate the need to perform one or more of the steps typically required by standard protocols to obtain a modified DNA product that can be sequenced by NGS. An abbreviated method (ABB method), a 1-step method, and a 2-step method are examples of methods for preparing a sequencing library, which can be found in patent application 13 / 555,037 filed on July 20, 2012 , which is incorporated by reference in its entirety. Sequencing methods
[00265] [00265] The methods and devices described here can employ next generation sequencing technology (NGS), which allows for massively parallel sequencing. In certain modalities, models of
[2009] [2009]; Metzker M Nature Rev 11: 31-46 [2010]). NGS sequencing technologies include, but are not limited to, pyrosequencing, sequencing by synthesis with coatable dye terminators, sequencing by oligonucleotide probe binding, and phon semiconductor sequencing. DNA from individual samples can be sequenced individually (ie, singleplex sequencing) or DNA from multiple samples can be grouped and sequenced as indexed genomic molecules (ie, multiplex sequencing) in a single sequencing run, to generate up to several hundred million DNA sequence readings. Examples of sequencing technologies that can be used to obtain the sequence information according to the present method are further described here.
[00266] [00266] Some sequencing technologies are commercially available, such as the Affymetrix Inc. hybridization sequencing platform (Sunnyvale, CA) and the 454 Life Sciences synthesis sequencing platforms (Bradford, CT), Illumina / Solexa (Hayward , CA) and Helicos Biosciences (Cambridge, MA), and the Applied Biosystems ligation sequencing platform (Foster Cidade, CA), as described below. In addition to single molecule sequencing performed using sequencing by Helicos Biosciences synthesis, other single molecule sequencing technologies include, but are not limited to, Pacific Biosciences SMRTTY technology, ION TORRENT'Y technology, and nanopore sequencing developed by example, by Oxford Nanopore Technologies.
[00267] [00267] Although the automated Sanger method is considered to be a “first generation” technology, Sanger sequencing including automated Sanger sequencing, can also be employed in the methods described here. Additional suitable sequencing methods include, but are not limited to, nucleic acid imaging technologies, for example, atomic force microscopy (AFM) or transmission electron microscopy (TEM). Illustrative sequencing technologies are described in more detail below.
[00268] [00268] In some embodiments, the methods described involve obtaining sequence information for nucleic acids in the test sample by massively parallel sequencing of millions of DNA fragments using sequencing by Ilumina synthesis and sequencing chemistry based on reversible terminator (for example as described in Bentley et al., Nature 6: 53-59 [2009]). Model DNA can be genomic DNA, for example, cellular DNA or cfDNA. In some embodiments, genomic DNA from isolated cells is used as the template, and is fragmented into lengths of several hundred base pairs. In other embodiments, cf (circulating tumor DNA or DNA (ctDNA) is used as the template, and fragmentation is not required as cfDNA or ctDNA exists as short fragments. For example fetal cfDNA circulates in the blood stream as fragments of approximately 170 base pairs (bp) long (Fan et al., Clin Chem 56: 1279-1286 [2010]), and no DNA fragmentation is required before sequencing. Illumination sequencing technology depends on the attachment of fragmented genomic DNA to an optically flat surface transparent to which oligonucleotide anchors are attached Model DNA is repaired at the end to generate 5'-phosphorylated blunt ends, and the Klenow fragment polymerase activity is used to add a single base to a 3rd end of the phosphorylated blind DNA fragments This addition prepares the DNA fragments for attachment to oligonucleotide adapters, which have a single T base protrusion at their 3rd end to increase
[00269] [00269] Several modalities of the description can use sequencing by synthesis that allows sequencing of paired end. In some embodiments, sequencing by Illumina synthesis platform involves grouping fragments. Clustering is a process in which each molecule of the fragment is amplified isothermally. In some embodiments, such as the example described here, the fragment has two different adapters attached to the two ends of the fragment, the adapters allowing the fragment to hybridize to the two different oligos on the surface of a flow cell strip. The fragment additionally includes or is connected to two index strings at two ends of the fragment, index strings which provide labels for identifying different samples in multiplex sequencing. On some sequencing platforms, a fragment to be sequenced from both ends is also referred to as an insert.
[00270] [00270] In some implementations, a flow cell for clustering on the Illumina platform is a strip of glass with bands. Each strip is a glass channel lined with a lawn of two types of oligos (for example, P5 and P7 'oligos). Hybridization is made possible by the first of the two types of oligos on the surface. This oligo is complementary to a first adapter at one end of the fragment. The polymerase creates a complementary strand of the hybridized fragment. The double stranded molecule is denatured, and the original stranded tape is washed away. The remaining tape, in parallel with many other remaining tapes, is clonally amplified by bridging.
[00271] [00271] In bridge amplification and other sequencing methods involving clustering, a ribbon bends, and a second adapter region at a second end of the ribbon hybridizes to the second type of oligos in the surface flow cell. The polymerase generates a complementary strand, forming the double stranded bridge molecule. This double-stranded molecule is denatured resulting in two single-stranded molecules tied to the flow cell through two different oligos. The process is then repeated several times, and occurs simultaneously for
[00272] [00272] After grouping, the sequencing starts by extending a first sequencing initiator to generate the first reading. With each cycle, fluorescently labeled nucleotides compete for addition to the growing chain. Only one is incorporated based on the model sequence. After the addition of each nucleotide, the group is excited by a light source, and a characteristic fluorescent signal is emitted. The number of cycles determines the length of the reading. The emission wavelength and signal strength determine a base call. For a given group all identical tapes are played simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel manner. At the conclusion of the first reading, the reading product is washed away.
[00273] [00273] In the next step of protocols involving two index primers, an index 1 primer is introduced and hybridized to an index 1 region in the model. Index regions provide fragment identification, which is useful for demultiplexing samples in a multiplex sequencing process. The index reading 1 is generated similar to the first reading. Upon completion of index reading 1, the reading product is washed away and the 3 'end of the tape is unprotected. The model ribbon then folds and attaches to a second oligo in the flow cell. An index 2 sequence is read in the same way as index 1. Then an index 2 reading product is washed away at the completion of the step.
[00274] [00274] After reading two indexes, reading 2 starts using polymerases to extend the second flow cell oligos, forming a double ribbon bridge. This double-stranded DNA is denatured, and the 3rd end is blocked. The original forward tape is cleaved and removed by
[00275] [00275] The synthesis sequencing example described earlier involves paired end readings, which is used in many of the described method modalities. Paired end sequencing involves 2 readings from the two ends of a fragment. Paired end readings allow users to choose the length of the insert (or fragment to be sequenced) and sequence any end of the insert, generating high-quality, alignable sequencing data. Once the distance between each paired reading is known, alignment algorithms can use this information to map readings in repetitive regions more precisely. This results in better alignment of readings, especially repetitive regions of the genome that are difficult to sequence. Paired end sequencing can detect rearrangements, including inserts and deletions (indels) and inversions.
[00276] [00276] Paired readings can use inserts of different lengths (ie, different size of the fragment to be sequenced. As the standard meaning in this description, paired end readings are used to refer to readings obtained from various insert lengths. In some cases, to distinguish paired end readings from
[00277] [00277] After sequencing DNA fragments, sequence readings of predetermined length, for example, 100 bp, are located by mapping (alignment) to a known reference genome. The mapped readings and their corresponding locations in the reference sequence are also referred to as markers. In another modality of the procedure, localization is performed by sharing the k-mere and alignment reading by reading. The analyzes of many modalities described here make use of readings that are poorly aligned or cannot be aligned, as well as aligned readings (markers). In one embodiment, the reference genome sequence is the NCBI36 / hg18 sequence, which is available on the World Wide Web at genoma.ucsc.edu/cgi-bin/hgGateway org=Human & db = hg18 & hgsid = 166260105). Alternatively, the reference genome sequence is GRCh37 / hg19 or GRCh38, which is available on the World Wide
[00278] [00278] Other sequencing methods can also be used to obtain sequence readings and alignments. Additional suitable methods are described in U.S. Patent Application No. 15 / 130,668 filed on April 15, 2016, which is incorporated by reference in its entirety.
[00279] [00279] In some embodiments of the methods described here, the sequence readings are about 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 8Obp, about 85bp, about 90bp, about 95bp, about 100bp, about 110bp, about 120bp, about 130b, about 140bp , about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500bp. It is expected that technological advances will enable single end readings of more than 500bp enabling readings of more than about 1000bp when paired end readings are generated. In some modalities, readings of
[00280] [00280] A plurality of sequence markers (i.e. readings aligned to a reference sequence) are typically obtained per sample. In some embodiments, at least about 3 x 10 th sequence markers, at least about 5 x 10 th sequence markers, at least about 8 x 10 th sequence markers, at least about 10 x 10 th sequence markers, at least about 15 x 10 th sequence markers, at least about 20 x 10 th sequence markers, at least about 30 x 10 th sequence markers, at least about 40 x
[00281] [00281] As should be apparent, certain modalities of the invention employ processes acting under control of instructions and / or data stored in or transferred through one or more computer systems. Certain modalities also relate to a device to perform these operations. Such apparatus may be specially designed and / or constructed for the required purpose, or may be a general purpose computer selectively configured by one or more computer programs and / or data structures stored on or otherwise made available to the computer. In particular, several general purpose machines can be used with programs written in accordance with the teachings here, or it may be more convenient to build a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines is shown and described below.
[00282] [00282] Certain modalities also provide functionality (for example, code and processes) to store any of the results (for example, query results) or data structures generated as described here. Such results or data structures are typically stored, at least temporarily, in a computer-readable medium. Results or data structures can also be made available in any of several ways such as display, printing and the like.
[00283] [00283] Examples of tangible computer-readable media suitable for use in computer program products and computer devices of that invention include, but are not limited to, magnetic media such as hard drives, floppy disks, and magnetic tape; optical media such as CD-ROM discs; magneto-optical media; semiconductor memory devices (for example, flash memory), and hardware devices that are specially configured to store and execute program instructions, such as read-only memory (ROM) and random access memory (RAM) devices and sometimes application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and signal transmission means to provide computer-readable instructions, such as local area networks, wide area networks, and the Internet. The data and program instructions provided here can also be incorporated into a carrier wave or other means of transport (including electronically or optically conducting tracks). The data and program instructions for that invention can also be incorporated into a carrier wave or other means of transport (for example, optical lines, power lines, and / or air waves).
[00284] [00284] Examples of program instructions include low-level code, such as that produced by a compiler, as well as higher-level code that can be executed by the computer using an interpreter. In addition, the program instructions can be machine code, source code and / or any other code that directly or indirectly controls the operation of a computing machine. The code can specify input, output, calculations, conditionals, branches, iterative loops, etc.
[00285] [00285] Analysis of the sequencing data and the diagnosis derived from it are typically performed using various algorithms and programs executed by computer. Therefore, certain modalities employ processes involving data stored on or transferred through one or more computer systems or other processing systems. The modalities described here also relate to devices to perform these operations. These devices can be specially built for the required purpose, or they can be a general purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and / or data structure stored on the computer. In some modalities, a group of processors performs some or all of the analytical operations cited collaboratively (for example, through a network or cloud computing) and / or in parallel. A processor or group of processors to perform the methods described here can be of various types including microcontrollers and microprocessors such as programmable devices (for example, CPLDs and FPGAs) and non-programmable devices such as ASIC port arrangement or general purpose microprocessors.
[00286] [00286] One implementation provides a system for use in determining a low frequency allele sequence in a test sample including nucleic acids, the system including a sequencer for receiving a nucleic acid sample and providing nucleic acid sequence information from sample; a processor; and a machine-readable storage medium having stored in the same instructions for execution on said processor determine a sequence of interest in the test sample by: (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique non-random molecular index, and where unique non-random molecular indexes of the adapters have at least two different molecular lengths and form a set of unique non-random molecular indexes of varying length (v «NRUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) the sequencing, using the sequencer, of the plurality
[00287] [00287] In some modalities of any of the systems provided here, the sequencer is configured to perform next generation sequencing (NGS). In some embodiments, the sequencer is configured to perform massively parallel sequencing using sequencing by synthesis with coatable dye terminators. In other embodiments, the sequencer is configured to perform link sequencing. In still other embodiments, the sequencer is configured to perform single molecule sequencing.
[00288] [00288] Another implementation provides a system including nucleic acid synthesizer, a processor, and a machine-readable storage medium having stored instructions for execution on said processor to prepare sequencing adapters. The instructions include: (a) providing the processor with a set of oligonucleotide sequences having at least two different molecular lengths; (b) select by the processor a subset of oligonucleotide sequences from the oligonucleotide sequence set, all editing distances between oligonucleotide sequences from the oligonucleotide sequence subset meeting a threshold value, the subset of oligonucleotide sequences forming a set of unique non-random molecular indices of variable length (vNRUMIs); and (c) synthesizing, using the nucleic acid synthesizer, a plurality of sequencing adapters, wherein each sequencing adapter comprises a hybridized double-stranded region, an arm
[00289] [00289] In addition, certain modalities relate to tangible and / or non-transitory computer readable means or computer program products that include program instructions and / or data (including data structures) to perform various computer-implemented operations. Examples of computer-readable media include, but are not limited to, semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media, and hardware devices that are specially configured for store and carry out program instructions, such as read-only memory (ROM) and random access memory (RAM) devices. Computer-readable media can be directly controlled by an end user or the media can be indirectly controlled by an end user. Examples of directly controlled media include media located at a user facility and / or media that are not shared with other entities. Examples of indirectly controlled media include media that are indirectly accessible to the user through an external network and / or through a service providing shared resources such as the "cloud." Examples of program instructions include both machine code, as produced by a compiler, and files containing higher-level code that can be executed by the computer using an interpreter.
[00290] [00290] In various modalities, the data or information used in the methods and devices described are provided in an electronic format. Such data or information may include readings and markers derived from a nucleic acid sample, reference sequences (including reference sequences providing solely or primarily polymorphisms), called such as cancer diagnosis calls,
[00291] [00291] One modality provides a computer program product to generate an output that indicates the sequence of a DNA fragment of interest in a test sample. The computer product may contain instructions for performing any one or more of the methods described above to determine a sequence of interest. As explained, the computer product may include a non-transitory and / or tangible computer-readable medium having computer executable or compile logic (eg instructions) written on it to enable a processor to determine a sequence of interest. In one example, the computer product comprises a computer-readable medium having computer executable or compile logic (for example, instructions) written on it to enable a processor to diagnose a condition or determine a nucleic acid sequence of interest.
[00292] [00292] It must be understood that it is not practical, or even possible in most cases, for a human being without assistance to perform the computational operations of the methods described here. For example, mapping a single 30 bp reading from a sample to any of the human chromosomes can take years of effort without the assistance of a computer device. Naturally, the problem is compounded since reliable calls of low-frequency allele mutations often require mapping thousands (for example, at least about 10,000) or even
[00293] [00293] The methods described here can be performed using a system to determine a sequence of interest in a test sample. The system may include: (a) a sequencer for receiving nucleic acids from the test sample providing information on the sample's nucleic acid sequence; (b) a processor; and (c) one or more computer-readable storage media that are stored in the same instructions for execution on said processor to determine a sequence of interest in the test sample. In some embodiments, the methods are instructed by a computer-readable medium having computer-readable instructions stored in the same instructions to execute a method to determine the sequence of interest. Thus a modality provides a computer program product including a non-transitory, machine-readable medium that stores program code that, when executed by one or more processors in a computer system, causes the computer system to implement a method for determining the sequences of nucleic acid fragments in a test sample. The program code may include: (a) a code to obtain a plurality of readings from a plurality of amplified polynucleotides, each polynucleotide from the plurality of amplified polynucleotides comprising an adapter attached to a DNA fragment, where the adapter comprises a molecular index single non-random, and in which single non-random molecular indices of the adapters have at least two different molecular lengths, forming a set of unique non-random molecular indices of variable length (vNRUMIs); (b) a code to identify, among the plurality of readings, readings associated with the same VNRUMIs; and (c) a code to determine, using the readings associated with the same VNRUMI, a sequence of a DNA fragment in the sample.
[00294] [00294] In some modalities, the program codes or the
[00295] [00295] The described methods can also be performed using a computer processing system that is adapted or configured to perform a method to determine a sequence of interest. One modality provides a computer processing system that is adapted or configured to execute a method as described here. In one embodiment, the apparatus includes a sequencing device adapted or configured for sequencing at least a portion of the nucleic acid molecules in a sample to obtain the type of sequence information described elsewhere here. The apparatus may also include components for processing the sample. Such components are described elsewhere here.
[00296] [00296] Sequence or other data, can be inserted in a computer or stored in a computer-readable medium, either directly or indirectly. In one embodiment, a computer system is directly coupled to a sequencing device that reads and / or analyzes sample nucleic acid sequences. Sequences or other information from such tools are provided through an interface on the computer system. Alternatively, the sequences processed by the system are provided with a source of sequence storage such as a database or other repository. Once available for the
[00297] [00297] In one example, a user provides a sample on a sequencing device. Data is collected and / or analyzed by the sequencing device that is connected to a computer. Computer software allows data collection and / or analysis. Data can be stored, displayed (via a monitor or similar device), and / or sent to another location. The computer can be connected to the internet which is used to transmit data to a portable device used by a remote user (for example, a doctor, scientist or analyst). It is understood that data can be stored and / or analyzed before transmission. In some modalities, raw data is collected and sent to a remote user or device that will analyze and / or store the data. Transmission can take place via the internet, but it can also take place via satellite or another connection. Alternatively, data can be stored on a computer-readable medium and the medium can be forwarded to an end user (for example, by mail). The remote user can be in the same geographic location or a different location including, but not limited to, a building, city, state, country or continent.
[00298] [00298] In some embodiments, the methods also include collecting data referring to a plurality of polynucleotide sequences (for example, readings, markers and / or reference chromosome sequences)
[00299] [00299] Among the types of electronically formatted data that can be stored, transmitted, analyzed, and / or manipulated in systems, devices, and methods described here are the following: readings obtained by sequencing nucleic acids in a sample of test markers obtained by aligning readings to a reference genome or other sequence or reference sequences the genome or reference sequence thresholds to call a test sample as affected, unaffected, or without calling the actual calls for medical conditions related to the sequence of interest diagnostics (clinical condition associated with calls) recommendations for additional testing derived from calls and / or diagnostics treatment and / or monitoring plans derived from calls and / or diagnostics
[00300] [00300] These various types of data can be obtained, stored transmitted, analyzed, and / or manipulated in one or more locations using different devices. Processing options cover a wide spectrum. At one end of the spectrum, all or much of this information is stored and used where the test sample is processed, for example, a clinic or other clinical settings. At the other extreme, the sample is obtained in one location, it is processed and optionally sequenced in a different location, readings are aligned and calls are made in one or more different locations, and diagnostics, recommendations, and / or plans are prepared in yet another location (which can be a location where the sample was taken).
[00301] [00301] In various modalities, the readings are generated with the sequencing apparatus and then transmitted to a remote site where they are processed to determine a sequence of interest. At that remote location, as an example, the readings are aligned to a reference sequence to produce anchor and anchored readings. Among the processing operations that can be employed in different locations are the following: sample collection sample processing preliminary to sequencing sequencing analysis of sequence data and derivation of medical calls diagnosis reporting of a diagnosis and / or a call to a patient or provider health care development plan for treatment, testing, and / or additional monitoring implementation of the counseling plan
[00302] [00302] Any one or more of these operations can be automated as described elsewhere here. Typically, the sequencing and analysis of medical call sequence and derivation data will be performed computationally. The other operations can be carried out manually or automatically.
[00303] [00303] Figure 6 shows an implementation of a dispersed system for producing a call or diagnosis of a test sample. A sample collection site 01 is used to obtain a test sample from a patient. The samples are then provided to a processing and sequencing site 03 where the test sample can be processed and sequenced as previously described. Site 03 includes apparatus for processing the sample as well as apparatus for sequencing the processed sample. The result of the sequencing, as described elsewhere here, is a collection of readings that are typically provided in an electronic format and provided to a network such as the internet, which is indicated by reference number 05 in figure 6.
[00304] [00304] The sequence data is provided to a remote location 07 where analysis and generation of calls are carried out. This location can include one or more powerful computing devices such as computers or processors. After the computational resources at site 07 have completed their analysis and generated a call from the received sequence information, the call is relayed to network 05. In some implementations, not only is a call generated at site 07 but an associated diagnosis is also generated . The call and / or diagnosis is / are then transmitted over the network and back to the sample collection site 01 as illustrated in figure 6. As explained, this is simply one of many variations of how the various operations associated with generating a call or diagnosis can be divided between several locations. A common variant involves providing sample collection and processing and sequencing in one location. Another variation involves providing processing and sequencing in the same location as analysis and generation of calls.
[00305] [00305] Figure 7 illustrates, in simple block format, a typical computer system that, when properly configured or designed, can serve as a computational device according to certain modalities. The 2000 computer system includes any number of 2002 processors (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 2006 (typically a random access memory, or RAM), primary storage 2004 (typically a read-only memory, or ROM). CPU 2002 can be of various types including microcontrollers and microprocessors such as programmable devices (for example, CPLDs and FPGAs) and non-programmable devices such as arrangement of ASIC ports or general purpose microprocessors. In the represented mode, primary storage 2004 acts to transfer data and instructions unidirectionally to the CPU and primary storage 2006 is typically used to transfer data and instructions in a bidirectional manner. Both of these primary storage devices can include any computer-readable medium such as those described above. A mass storage device 2008 is also bi-directionally coupled to primary storage 2006 and provides additional data storage capacity and can include any of the computer-readable media described above. The mass storage device 2008 can be used to store programs, data and the like and is typically a secondary storage medium such as a hard drive. Often, such programs, data and the like are temporarily copied to primary memory 2006 for execution on CPU 2002. It will be appreciated that information retained within the 2008 mass storage device can, in appropriate cases, be incorporated in a standardized way as part primary storage
[00306] [00306] CPU 2002 is also coupled to a 2010 interface that connects to one or more input / output devices such as such as a nucleic acid sequencer (2020), a nucleic acid synthesizer (2022), video monitors , trackballs, mice, keyboards, microphones, touch screens, transducer card readers, magnetic or paper tape readers, tablets, stylus, speech or handwriting recognition peripherals, USB ports, or other well-known input devices such as, of course, other computers. Finally, CPU 2002 can optionally be attached to an external device such as a database or a computer or telecommunications network using an external connection as shown generally in 2012. With such a connection, it is contemplated that the CPU can receive information from the network, or you can make information available to the network in the course of performing the method steps described here. In some implementations, a nucleic acid sequencer or a nucleic acid synthesizer, can be communicatively linked to CPU 2002 via the 2012 network connection instead of or in addition to through the 2010 interface.
[00307] [00307] In one embodiment, a system such as a 2000 computer system is used as a data import, data correlation, and query system capable of performing some or all of the tasks described here. Information and programs, including data files, can be provided through a 2012 network connection for access or download by a researcher. Alternatively, such information, programs and files can be provided to the researcher on a storage device.
[00308] [00308] In a specific embodiment, the 2000 computer system is directly coupled to a data acquisition system such as a high-performance microarray screening system, or a nucleic acid sequencer (2020) that captures sample data. Data from such systems are provided through the 2010 interface for analysis by the 2000 system. Alternatively, the data processed by the 2000 system is provided from a data storage source such as a database or other relevant data repository. Once in the device 2000, a memory device such as primary storage 2006 or mass storage 2008 stores in buffer or stores, at least temporarily, relevant data. The memory can also store various routines and / or programs for importing, analyzing and displaying data, including sequence readings, UMIs, codes for determining sequence readings, collapsing sequence readings and correcting errors in readings, etc.
[00309] [00309] In certain embodiments, the computers used here may include a user terminal, which can be any type of computer (for example, desktop, laptop, tablet, etc.), media computing platforms (for example, cable, satellite decoders, digital video recorders, etc.), portable computing devices (for example, PDAs, email clients, etc.), cell phones or any other type of computing or communication platforms.
[00310] [00310] In certain embodiments, the computers used here may also include a server system in communication with a user terminal, server system that may include a decentralized server device or server devices, and may include mainframe computers, minicomputers, supercomputers, personal computers, or combinations thereof. A plurality of server systems can also be used without departing from the scope of the present invention. User terminal and a server system can communicate with each other over a network. The network can comprise, for example, wired networks such as LANs (local area networks), WANs (wide area networks), MANs (metropolitan area networks), ISDNs (Digital Integrated Service Networks), etc. as well as wireless networks such as wireless LANs, CDMA, Bluetooth, and satellite communication networks, etc. without limiting the scope of the present invention.
[00311] [00311] Table 1 shows the heterogeneity of base pairs of NRUMIs, in comparison with the heterogeneity of base pairs of VNRUMIs according to some implementations. This set of 120 VNRUMIs is comprised of 50 hexamers and 70 heptamers. The NRUMI set is completely comprised of 218 hexamers, where the minimum editing distance between any two NRKUMIs has a threshold value. Table 1 assumes that each of the 218 or 128 bar codes was present in equal quantities, for example, there are 1000 of each UMI. For the 7th base, the new vNRUMI set has much better heterogeneity than an original NRUMI set, and far exceeds the recommended minimum of 5% of the composition per base. Thus, it is clear that the vNRUMI project addresses the aforementioned challenge of not having diversity of base pairs in certain cycles. Other sets of bar codes comprised exclusively of hexamers have a heterogeneity on the basis similar to the set of original NRUMI shown below. Table 1: Heterogeneity of base pairs among UMI positions
[00312] [00312] Using the previous NRUMIs and vNRUMIs, in silico simulation studies were performed to simulate 10,000 barcodes, mutated in each barcode by mutating each base independently, and attempted to retrieve the original UMI sequence. The simulation used a mutation rate of 2% in each base (1% chance for SNV, 1% chance for size 1 indel). Note that this mutation rate is appreciably higher than typical Illumin sequencing error rates. Each of the 10,000 simulations contained at least one mutation.
[00313] [00313] To provide additional comparison to other methods using UMIs, a set of 114 NKUMI sequences of 6 nt length generated according to an existing approach nxCode are also used in this simulation study. See http://hannonlab.cshl.edu/nxCo of / nxCode / main.html. These sequences were subjected to the same mutation processes as previously described. The nxCode approach uses a probabilistic model to determine mutations, and uses a semigulous approach to obtain a set of NRKRUMI having equal molecular length. The results of the comparison between the sets of vWNRUMI, NRUMI, and nxCode can be found in Table 2. Table 2: Comparative results comparing error correction rates for different UMI Metrics vNRUMI NRUMI nxCode Simulated Mutated UMIs 10,000 10,000 - 10,000 Mutated UMIs Simulated 7,703 2,447 3,829 Among the closest matches 9,242 9,779 9,629 Average size of the closest set 1,2138 3,0261 2,0978 Among the closest match or second plus 9,927 9,865 9,897 next Average size of the second closest set 3,939] 7,781 6.0504
[00314] [00314] The set of vNRUMIs has 120 UMIs, of which 50 UMIs are 6 nt long and 70 UMIs are 7 nt long. The set of NRUMIs has 218 strings of length 6. A conventional nxCode approach uses a set of NRKUMI of 114 strings of length 6 nt. The average size of a set is the average number of unique strings included in a set.
[00315] [00315] In table 2, a single correction is defined as a case where the set of closest neighbors has only one sequence in it; in other words, the matching UMI and the correction algorithm described earlier gave an ambiguous suggestion for the most likely true VWNRUMI. Note that the number of such uniquely correctable strings is much higher for the YvWNRUMI methodology than NRUMI and nxCode. Also, the average size of the closest set / second closest set is much smaller in the vNRUMI approach than in other solutions, while the rate at which the original unchanged barcode is contained among those sets is approximately equal. This is important since during reading collapse, contextual information is used to select a correct UMI from these closest / second closest sets. Providing this step of reading collapse with fewer incorrect sequences can decrease the chance of making a wrong choice, ultimately improving the ability to suppress noise and detect variants.
[00316] [00316] It is interesting to note that the NRUMI and nxCode approaches, like other previous barcode strategies, assume that the barcode strings are all of uniform length. In producing this simulation, to provide direct comparisons between the three approaches, the original methods for correcting errors described by the NRUMI and nxCode approaches were not used, which may have limited the performance of the NRUMI and nxCode approaches. However, the data in Table 2 provides an insight into the potential capacity of the VNRUMI approach to improve error correction, which is further illustrated in the following example. Example 2 Recovery of DNA fragments using VvNRUMIs and NRKRUMIs
[00317] [00317] In another set of in-silo studies, the capabilities of vVNRUMI and NRUMIs to retrieve readings are tested. The studies take a random COSMIC mutation and generate a single fragment of DNA containing that mutation. The size of the fragment has an average of 166, and a standard deviation of 40. The simulation adds a random UMI to both ends of that fragment. ART (see, for example, https://www.niehs.nih.gov/research/resources/software/biostatistics/art/) is used to simulate 10 paired end readings of that UMI fragment of the UMI molecule, and align those readings using a burrows wheeler (BWA) aligner. See, for example, http: // bio-bwa. sourceforge.net/.
[00318] [00318] Then the process passes the alignment on a proprietary reading collector, ReCo, determining if it can recover the original fragment sequence and repeat the process for further readings.
[00319] [00319] Table 3 shows the numbers and percentages of fragments that could be recovered. Table 3: Error correction rates for NKUMI and vWRUMI projects
[00320] [00320] The vNRUMI method recovered more fragments than the fixed-length NRUMI method. A chi-square test shows that the differences are significant. y "2 = 4.297, two-tailed P-value = 0.0382. Using a = 0.05, the vWNRUMI method achieved statistically better error correction performance compared to the NRKRUMI method,
[00321] [00321] The NRUMI strategy manipulates sets of NRUMI of heterogeneous length. This addresses the issue of base pair diversity that caused a drop in alignment quality.
[00322] [00322] New processes are provided to generate sets of UMIs of variable length that satisfy biochemical restrictions, and to map UMIs read incorrectly to correct UMIs. The new approach addresses the issue of decreased sequencing quality caused by uniform length barcodes. The use of a matching scheme that recognizes the number of matches and mismatches, as opposed to just tracking mismatches, allows improving the error correction capability. The implementations are comparable to or exceed existing solutions, while providing additional functionality.
[00323] [00323] The present description can be incorporated in other specific ways without departing from its spirit or essential characteristics. The described modalities should be considered in all aspects only as illustrative and not restrictive. The scope of the description is therefore indicated by the appended claims instead of the previous description. All changes arising within the meaning and equivalence range of the claims must be within its scope.

权利要求:
Claims (46)
[1]
1. Method for sequencing nucleic acid molecules in a sample, characterized by the fact that it comprises (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique non-random molecular index, and wherein unique non-random molecular indices of the adapters have at least two different molecular lengths and form a set of unique non-random molecular indices of varying length (v «NRUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of VNRUMIs; (d) identify, among the plurality of readings, readings associated with the same single non-random molecular index of variable length (v «NRUMI); and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same vNRUMI.
[2]
2. Method according to claim 1, characterized by the fact that the identification of the readings associated with the same vNRUMI comprises obtaining, for each reading of the plurality of readings, alignment scores in relation to the set of VNRUMIs, each alignment score indicating similarity between a subsequence of a reading and a vNRUMI, where the subsequence is in a region of the reading in which nucleotides derived from VvWNRUMI are likely to be located.
[3]
3. Method according to claim 2, characterized by the fact that the alignment scores are based on nucleotide pairings and nucleotide editions between the reading sequence and the VNRUMI.
[4]
4. Method according to claim 3, characterized in that the nucleotide editions comprise nucleotide substitutions, additions, and deletions.
[5]
5. Method according to claim 3, characterized by the fact that each alignment score penalizes mismatches at the beginning of a sequence but does not penalize mismatches at the end of the sequence.
[6]
6. Method according to claim 5, characterized by the fact that obtaining an alignment score between a reading and a VNRUMI comprises: (a) calculating an alignment score between VvNRUMI and each of all possible sequences of prefix of the reading sequence; (b) calculate an alignment score between the reading sequence and each of all possible vNRUMI prefix sequences; and (c) obtain a higher alignment score among the alignment scores calculated in (a) and (b) as the alignment score between reading and VvNRUMI.
[7]
Method according to claim 2, characterized in that the subsequence has a length that is equal to a length of the longest vNRUMI in the set of vNRUMIs.
[8]
8. Method according to claim 2, characterized by the fact that the identification of the readings associated with the same vYNRUMI in (d) additionally comprises: selecting, for each reading of the plurality of readings, at least one vNRUMI from the set of VY 'NRUMIs based on alignment scores; and associate each reading of the plurality of readings with at least one vNRUMI selected for the reading.
[9]
9. Method according to claim 8, characterized by the fact that selecting at least one vNRUMI from the set of VNRUMIs comprises selecting a vNRUMI having a higher alignment score among the set of vNRUMIs.
[10]
10. Method according to claim 8, characterized in that the at least one vNRUMI comprises two or more VNRUMIs.
[11]
11. Method according to claim 10, characterized in that it additionally comprises selecting one of the two or more VNRUMI as the same vYNRUMI of (d) and (e).
[12]
12. Method according to claim 1, characterized by the fact that the adapters applied in (a) are obtained by: (1) providing a set of oligonucleotide sequences having at least two different molecular lengths; (1) the selection of a subset of oligonucleotide sequences from the set of oligonucleotide sequences, all editing distances between oligonucleotide sequences of the subset of oligonucleotide sequences meeting a threshold value, the subset of oligonucleotide sequences forming the set of vNWNRUMIs; and (111) the synthesis of the adapters, each comprising a hybridized double-stranded region, a single 5 'ribbon arm, a single 3' ribbon arm, and at least one vYNRUMI from the VvNRUMI array.
[13]
13. Method according to claim 12, characterized by the fact that the threshold value is 3.
[14]
14. Method according to claim 1, characterized in that the set of vYNRUMIs comprises 6 nucleotide vNRUMIs and 7 nucleotide vNRUMIs.
[15]
15. Method according to claim 1, characterized by the fact that (e) comprises collapsing the readings associated with the same VNRUMI in a group to obtain a consensus nucleotide sequence for the sequence of the DNA fragment in the sample.
[16]
16. Method according to claim 15, characterized by the fact that the consensus nucleotide sequence is obtained based partially on the quality scores of the readings.
[17]
17. Method according to claim 1, characterized by the fact that (e) comprises: identifying, among the readings associated with the same vNRUMI, readings having the same reading position or similar reading positions in a reference sequence, and determining the sequence of the DNA fragment using readings that (1) are associated with the same vNRUMI and (ii) have the same reading position or similar reading positions in the reference sequence.
[18]
18. Method according to claim 1, characterized by the fact that the set of vNRUMIs includes no more than about
10,000 different vNRUMIs.
[19]
19. Method according to claim 18, characterized by the fact that the set of vNRUMIs includes no more than about
1,000 different vNRUMIs.
[20]
20. Method according to claim 19, characterized in that the set of YNRUMISs includes no more than about 200 different VNRUMIs.
[21]
21. Method according to claim 1, characterized in that the application of adapters to the DNA fragments in the sample comprises applying adapters to both ends of the DNA fragments in the sample.
[22]
22. Method for preparing sequencing adapters, characterized by the fact that it comprises: (a) providing a set of oligonucleotide sequences having at least two different molecular lengths; (b) selecting a subset of oligonucleotide sequences from the set of oligonucleotide sequences, all editing distances between oligonucleotide sequences of the subset of oligonucleotide sequences meeting a threshold value, the subset of oligonucleotide sequences forming a set of unique non-random molecular indices of variable length (v «NRUMIs); and (c) synthesizing a plurality of sequencing adapters, each sequencing adapter comprising a hybridized double-stranded region, a single 5 'ribbon arm, a single 3' ribbon arm, and at least one vYNRUMI from the set of VvNRUMIs.
[23]
23. The method of claim 22, characterized by the fact that (b) comprises: (1) selecting an oligonucleotide sequence from the set of oligonucleotide sequences; (11) adding the selected oligonucleotide to an oligonucleotide sequence expansion set and removing the selected oligonucleotide from the oligonucleotide sequence set to obtain a reduced set of oligonucleotide sequences; (111) selecting a present oligonucleotide sequence from the reduced set that maximizes a distance function, where the distance function is a minimum editing distance between the present oligonucleotide sequence and any oligonucleotide sequences in the expansion set, and where the distance function meets the threshold value; (iv) adding the present oligonucleotide to the expansion set and removing the present oligonucleotide from the reduced set; (v) repeat (111) and (iv) one or more times; and (vi) providing the expansion set as the subset of oligonucleotide sequences forming the set of VvNRUMIs.
[24]
24. Method according to claim 23, characterized by the fact that (v) comprises repeating (il) and (iv) until the distance function no longer meets the threshold value.
[25]
25. Method according to claim 23, characterized in that (v) comprises repeating (111) and (Iv) until the expansion set reaches a defined size.
[26]
26. The method of claim 23, wherein the present oligonucleotide sequence or oligonucleotide sequence in the expansion set is shorter than the longest oligonucleotide sequence in the set of oligonucleotide sequences, the method characterized by the fact that it further comprises, before (iii), (1) attaching a thymine base or a thymine base plus any of four bases to the present oligonucleotide sequence or to the oligonucleotide sequence in the expansion set, thus generating a sequence filled in having the same length as the longest oligonucleotide sequence in the set of oligonucleotide sequences, and (2) use the filled sequence to calculate the minimum editing distance.
[27]
27. Method according to claim 22, characterized by the fact that the editing distances are Levenshtein distances.
[28]
28. Method according to claim 22, characterized by the fact that the threshold value is three.
[29]
29. The method of claim 22, characterized in that it further comprises, prior to (b), removing certain oligonucleotide sequences from the set of oligonucleotide sequences to obtain a filtered set of oligonucleotide sequences; and providing the filtered set of oligonucleotide sequences as the set of oligonucleotide sequences from which the subset is selected.
[30]
30. The method of claim 29, characterized in that certain oligonucleotide sequences comprise oligonucleotide sequences having three or more consecutive identical bases.
[31]
31. The method of claim 29, characterized in that certain oligonucleotide sequences comprise oligonucleotide sequences having a combined number of guanine and cytosine bases less than 2 and oligonucleotide sequences having a combined number of guanine bases and cytosine greater than 4.
[32]
32. The method of claim 29, characterized in that certain oligonucleotide sequences comprise oligonucleotide sequences having the same base in the last two positions.
[33]
33. The method of claim 29, characterized in that certain oligonucleotide sequences comprise oligonucleotide sequences having a subsequence that matches the 3rd end of one or more sequencing primers.
[34]
34. The method of claim 29, characterized in that certain oligonucleotide sequences comprise oligonucleotide sequences having a thymine base at the last position of the oligonucleotide sequences.
[35]
35. Method according to claim 22, characterized in that the set of vYNRUMIs comprises 6 nucleotide vNRUMIs and 7 nucleotide vNRUMIs.
[36]
36. Method for sequencing nucleic acid molecules in a sample, characterized by the fact that it comprises (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique non-random molecular index, and wherein unique non-random molecular indices of the adapters have at least two different molecular lengths and form a set of unique non-random molecular indices of varying length (v «NRUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of VNRUMIs; and (d) identify, among the plurality of readings, readings associated with the same single non-random molecular index of variable length (v «NRUMI).
[37]
37. Method according to claim 36, characterized by the fact that it additionally comprises obtaining a count of the readings associated with the same vNRUMI.
[38]
38. Method for sequencing nucleic acid molecules in a sample, characterized by the fact that it comprises (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique molecular index (UMD), and wherein the adapter's unique molecular indexes (UMIs) have at least two different molecular lengths and form a set of unique variable length molecular indexes (vUMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of vVUMIs; and (d) to identify, among the plurality of readings, readings associated with the same unique molecular index of variable length (VUMD).
[39]
39. Method according to claim 38, characterized in that it further comprises determining a sequence of a DNA fragment in the sample using the readings associated with the same vUMI.
[40]
40. Method according to claim 38, characterized by the fact that it additionally comprises obtaining a count of the readings associated with the same vUMIs.
[41]
41. Method for sequencing nucleic acid molecules in a sample, characterized by the fact that it comprises (a) applying adapters to DNA fragments in the sample to obtain DNA adapter products, where each adapter comprises a unique Molecular Index (UMI) in a set of unique molecular indices (UMIs); (b) amplifying the DNA adapter products to obtain a plurality of amplified polynucleotides; (c) sequencing the plurality of amplified polynucleotides, thus obtaining a plurality of readings associated with the set of UMIs; (d) obtain, for each reading of the plurality of readings, alignment scores in relation to the set of UMIs, each alignment score indicating similarity between a subsequence of a reading and a UMI; (e) identify, among the plurality of readings, readings associated with the same UMI using the alignment scores; and (e) determining a sequence of a DNA fragment in the sample using the readings associated with the same UMI.
[42]
42. Method according to claim 41, characterized in that the alignment scores are based on nucleotide pairings and nucleotide editions between the subsequence of the reading and the UMI.
[43]
43. Method according to claim 42, characterized by the fact that each alignment score penalizes mismatches at the beginning of a sequence, but does not penalize mismatches at the end of the sequence.
[44]
44. Method according to claim 41, characterized in that the set of UMIs comprises UMIs of at least two different molecular lengths.
[45]
45. Computer program product, characterized by the fact that it comprises a non-transitory, machine-readable medium that stores program code that, when executed by one or more processors in a computer system, causes the computer system to implement a a method for sequencing nucleic acid molecules in a sample, said program code comprising: (a) a code for obtaining a plurality of readings from a plurality of amplified polynucleotides, each polynucleotide of the plurality of amplified polynucleotides comprising an adapter attached to a fragment of DNA, where the adapter comprises a single nonrandom molecular index, and where unique nonrandom molecular indexes of the adapters have at least two different molecular lengths, forming a set of unique nonrandom molecular indexes of varying length (v «NRUMIs) ; (b) a code to identify, among the plurality of readings, readings associated with the same vNRUMIs; and (c) a code to determine, using the readings associated with the same VYNRUMI, a sequence of a DNA fragment in the sample.
[46]
46. Computer system, characterized by the fact that it comprises: one or more processors; system memory; and one or more computer-readable storage media that have stored in the same computer-executable instructions that cause the computer system to implement a method for determining sequence information of a sequence of interest in a sample, instructions comprising: (a ) obtaining a plurality of readings from a plurality of amplified polynucleotides, each polynucleotide from the plurality of amplified polynucleotides comprising an adapter attached to a DNA fragment, where the adapter comprises a unique non-random molecular index, and where unique non-random molecular indices the adapters have at least two different molecular lengths, forming a set of unique non-random molecular indices of varying length (v «NRUMIs); (b) identify, among the plurality of readings, the readings associated with the same vYNRUM Is; and
(c) determine, using the readings associated with the same VNRUMI, a sequence of a DNA fragment in the sample.

类似技术:

公开号 | 公开日 | 专利标题

US10844429B2|2020-11-24|Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths

AU2019250200B2|2021-10-14|Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices |

US20190085384A1|2019-03-21|Universal short adapters with variable length non-random unique molecular identifiers

US20180334712A1|2018-11-22|Universal short adapters for indexing of polynucleotide samples

US11028435B2|2021-06-08|Optimal index sequences for multiplex massively parallel sequencing

US10152569B2|2018-12-11|Algorithms for sequence determinations

BR112021006402A2|2021-09-21|SEQUENCE-GRAPH BASED TOOL TO DETERMINE VARIATION IN SHORT TANDEM REPETITION REGIONS

同族专利:

公开号 | 公开日

CN110313034A|2019-10-08|

CA3050247A1|2018-07-26|

RU2019122349A3|2021-06-02|

RU2766198C2|2022-02-09|

AU2018210188A1|2019-08-01|

US20180201992A1|2018-07-19|

US10844429B2|2020-11-24|

WO2018136248A1|2018-07-26|

RU2019122349A|2021-02-19|

KR20190117529A|2019-10-16|

EP3571616B1|2021-05-19|

EP3889962A1|2021-10-06|

JP2020505947A|2020-02-27|

SG11201906428SA|2019-08-27|

US20210079462A1|2021-03-18|

EP3571616A1|2019-11-27|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US4683202B1|1985-03-28|1990-11-27|Cetus Corp|

US4683195B1|1986-01-30|1990-11-27|Cetus Corp|

CA2044616A1|1989-10-26|1991-04-27|Pepi Ross|Dna sequencing|

US5677170A|1994-03-02|1997-10-14|The Johns Hopkins University|In vitro transposition of artificial transposons|

ES2563643T3|1997-04-01|2016-03-15|Illumina Cambridge Limited|Nucleic acid sequencing method|

US6159736A|1998-09-23|2000-12-12|Wisconsin Alumni Research Foundation|Method for making insertional mutations using a Tn5 synaptic complex|

AR021833A1|1998-09-30|2002-08-07|Applied Research Systems|METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID|

US20030064366A1|2000-07-07|2003-04-03|Susan Hardin|Real-time sequence determination|

EP1354064A2|2000-12-01|2003-10-22|Visigen Biotechnologies, Inc.|Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity|

US7057026B2|2001-12-04|2006-06-06|Solexa Limited|Labelled nucleotides|

SI3363809T1|2002-08-23|2020-08-31|Illumina Cambridge Limited|Modified nucleotides for polynucleotide sequencing|

US20040018520A1|2002-04-22|2004-01-29|James Thompson|Trans-splicing enzymatic nucleic acid mediated biopharmaceutical and protein|

WO2005065814A1|2004-01-07|2005-07-21|Solexa Limited|Modified molecular arrays|

US7302146B2|2004-09-17|2007-11-27|Pacific Biosciences Of California, Inc.|Apparatus and method for analysis of molecules|

WO2006064199A1|2004-12-13|2006-06-22|Solexa Limited|Improved method of nucleotide detection|

GB0514936D0|2005-07-20|2005-08-24|Solexa Ltd|Preparation of templates for nucleic acid sequencing|

US7405281B2|2005-09-29|2008-07-29|Pacific Biosciences Of California, Inc.|Fluorescent nucleotide analogs and uses therefor|

JP5122555B2|2006-03-31|2013-01-16|ソレクサ・インコーポレイテッド|Synthetic sequencing system and apparatus|

EP2089517A4|2006-10-23|2010-10-20|Pacific Biosciences California|Polymerase enzymes and reagents for enhanced nucleic acid sequencing|

US8262900B2|2006-12-14|2012-09-11|Life Technologies Corporation|Methods and apparatus for measuring analytes using large scale FET arrays|

WO2008093098A2|2007-02-02|2008-08-07|Illumina Cambridge Limited|Methods for indexing samples and sequencing multiple nucleotide templates|

US8932812B2|2009-12-17|2015-01-13|Keygene N.V.|Restriction enzyme based whole genome sequencing|

US9260745B2|2010-01-19|2016-02-16|Verinata Health, Inc.|Detecting and classifying copy number variation|

ES2534986T3|2010-01-19|2015-05-04|Verinata Health, Inc|Simultaneous determination of aneuploidy and fetal fraction|

WO2012040387A1|2010-09-24|2012-03-29|The Board Of Trustees Of The Leland Stanford Junior University|Direct capture, amplification and sequencing of target dna using immobilized primers|

PL2697397T3|2011-04-15|2017-08-31|The Johns Hopkins University|Safe sequencing system|

JP6028025B2|2011-07-08|2016-11-16|キージーン・エン・フェー|Sequence-based genotyping based on oligonucleotide ligation assays|

AU2012327251A1|2011-10-27|2013-05-23|Verinata Health, Inc.|Set membership testers for aligning nucleic acid samples|

WO2013138510A1|2012-03-13|2013-09-19|Patel Abhijit Ajit|Measurement of nucleic acid variants using highly-multiplexed error-suppressed deep sequencing|

CA2873585C|2012-05-14|2021-11-09|Cb Biotechnologies, Inc.|Method for increasing accuracy in quantitative detection of polynucleotides|

AU2013267609C1|2012-05-31|2019-01-03|Board Of Regents, The University Of Texas System|Method for accurate sequencing of DNA|

US20140024541A1|2012-07-17|2014-01-23|Counsyl, Inc.|Methods and compositions for high-throughput sequencing|

US10557133B2|2013-03-13|2020-02-11|Illumina, Inc.|Methods and compositions for nucleic acid sequencing|

US9328382B2|2013-03-15|2016-05-03|Complete Genomics, Inc.|Multiple tagging of individual long DNA fragments|

EP3421613B1|2013-03-15|2020-08-19|The Board of Trustees of the Leland Stanford Junior University|Identification and use of circulating nucleic acid tumor markers|

CN109599148A|2013-10-01|2019-04-09|考利达基因组股份有限公司|That identifies the variation in genome determines phase and connection method|

AU2014369841B2|2013-12-28|2019-01-24|Guardant Health, Inc.|Methods and systems for detecting genetic variants|

US9677132B2|2014-01-16|2017-06-13|Illumina, Inc.|Polynucleotide modification on solid support|

EP3191628A4|2014-09-12|2018-05-02|The Board of Trustees of the Leland Stanford Junior University|Identification and use of circulating nucleic acids|

WO2016168351A1|2015-04-15|2016-10-20|The Board Of Trustees Of The Leland Stanford Junior University|Robust quantification of single molecules in next-generation sequencing using non-random combinatorial oligonucleotide barcodes|

US10844428B2|2015-04-28|2020-11-24|Illumina, Inc.|Error suppression in sequenced DNA fragments using redundant reads with unique molecular indices |

US20170211140A1|2015-12-08|2017-07-27|Twinstrand Biosciences, Inc.|Adapters, methods, and compositions for duplex sequencing|

US20170355984A1|2016-06-10|2017-12-14|Counsyl, Inc.|Nucleic acid sequencing adapters and uses thereof|

US20190085384A1|2017-09-15|2019-03-21|Illumina, Inc.|Universal short adapters with variable length non-random unique molecular identifiers|PL2697397T3|2011-04-15|2017-08-31|The Johns Hopkins University|Safe sequencing system|

US10844428B2|2015-04-28|2020-11-24|Illumina, Inc.|Error suppression in sequenced DNA fragments using redundant reads with unique molecular indices |

WO2019067092A1|2017-08-07|2019-04-04|The Johns Hopkins University|Methods and materials for assessing and treating cancer|

EP3844497A2|2018-08-28|2021-07-07|F. Hoffmann-La Roche AG|Nanopore sequencing device comprising ruthenium-containing electrodes|

AU2019403077A1|2018-12-19|2021-06-17|F. Hoffmann-La Roche Ag|3' protected nucleotides|

WO2020136133A1|2018-12-23|2020-07-02|F. Hoffmann-La Roche Ag|Tumor classification based on predicted tumor mutational burden|

US11210554B2|2019-03-21|2021-12-28|Illumina, Inc.|Artificial intelligence-based generation of sequencing metadata|

EP3836148A1|2019-12-09|2021-06-16|Lexogen GmbH|Index sequences for multiplex parallel sequencing|

WO2021158989A1|2020-02-07|2021-08-12|Lodo Therapeutics Corporation|Methods and apparatus for efficient and accurate assembly of long-read genomic sequences|

WO2022010965A1|2020-07-08|2022-01-13|Illumina, Inc.|Beads as transposome carriers|

WO2022031955A1|2020-08-06|2022-02-10|Illumina, Inc.|Preparation of rna and dna sequencing libraries using bead-linked transposomes|

CN111968706B|2020-10-20|2021-02-12|安诺优达基因科技有限公司|Method for obtaining target sequencing data of target sample and method for assembling sequence of target sample|

法律状态:
2021-11-03| B350| Update of information on the portal [chapter 15.35 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US201762447851P| true| 2017-01-18|2017-01-18|

US62/447851|2017-01-18|

PCT/US2018/012669|WO2018136248A1|2017-01-18|2018-01-05|Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths|

[返回顶部]